oxdev commited on
Commit
93b6a9a
·
verified ·
1 Parent(s): 9ab390c

GRPO training complete — smart contract security auditor

Browse files
README.md CHANGED
@@ -1,201 +1,68 @@
1
  ---
 
2
  library_name: transformers
 
3
  tags:
4
- - trl
5
  - grpo
 
 
 
6
  ---
7
 
8
- # Model Card for Model ID
9
-
10
- <!-- Provide a quick summary of what the model is/does. -->
11
-
12
-
13
-
14
- ## Model Details
15
-
16
- ### Model Description
17
-
18
- <!-- Provide a longer summary of what this model is. -->
19
-
20
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
21
-
22
- - **Developed by:** [More Information Needed]
23
- - **Funded by [optional]:** [More Information Needed]
24
- - **Shared by [optional]:** [More Information Needed]
25
- - **Model type:** [More Information Needed]
26
- - **Language(s) (NLP):** [More Information Needed]
27
- - **License:** [More Information Needed]
28
- - **Finetuned from model [optional]:** [More Information Needed]
29
-
30
- ### Model Sources [optional]
31
-
32
- <!-- Provide the basic links for the model. -->
33
-
34
- - **Repository:** [More Information Needed]
35
- - **Paper [optional]:** [More Information Needed]
36
- - **Demo [optional]:** [More Information Needed]
37
-
38
- ## Uses
39
-
40
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
41
-
42
- ### Direct Use
43
-
44
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
45
-
46
- [More Information Needed]
47
-
48
- ### Downstream Use [optional]
49
-
50
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
51
-
52
- [More Information Needed]
53
-
54
- ### Out-of-Scope Use
55
-
56
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
57
-
58
- [More Information Needed]
59
-
60
- ## Bias, Risks, and Limitations
61
-
62
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
63
-
64
- [More Information Needed]
65
-
66
- ### Recommendations
67
-
68
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
69
-
70
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
71
-
72
- ## How to Get Started with the Model
73
-
74
- Use the code below to get started with the model.
75
-
76
- [More Information Needed]
77
-
78
- ## Training Details
79
-
80
- ### Training Data
81
-
82
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
83
-
84
- [More Information Needed]
85
-
86
- ### Training Procedure
87
-
88
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
89
-
90
- #### Preprocessing [optional]
91
-
92
- [More Information Needed]
93
-
94
-
95
- #### Training Hyperparameters
96
-
97
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
98
-
99
- #### Speeds, Sizes, Times [optional]
100
-
101
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
102
-
103
- [More Information Needed]
104
-
105
- ## Evaluation
106
-
107
- <!-- This section describes the evaluation protocols and provides the results. -->
108
-
109
- ### Testing Data, Factors & Metrics
110
-
111
- #### Testing Data
112
-
113
- <!-- This should link to a Dataset Card if possible. -->
114
-
115
- [More Information Needed]
116
-
117
- #### Factors
118
-
119
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
120
-
121
- [More Information Needed]
122
-
123
- #### Metrics
124
-
125
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
126
-
127
- [More Information Needed]
128
-
129
- ### Results
130
-
131
- [More Information Needed]
132
-
133
- #### Summary
134
-
135
-
136
-
137
- ## Model Examination [optional]
138
-
139
- <!-- Relevant interpretability work for the model goes here -->
140
-
141
- [More Information Needed]
142
-
143
- ## Environmental Impact
144
-
145
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
146
-
147
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
148
-
149
- - **Hardware Type:** [More Information Needed]
150
- - **Hours used:** [More Information Needed]
151
- - **Cloud Provider:** [More Information Needed]
152
- - **Compute Region:** [More Information Needed]
153
- - **Carbon Emitted:** [More Information Needed]
154
-
155
- ## Technical Specifications [optional]
156
-
157
- ### Model Architecture and Objective
158
-
159
- [More Information Needed]
160
-
161
- ### Compute Infrastructure
162
-
163
- [More Information Needed]
164
-
165
- #### Hardware
166
-
167
- [More Information Needed]
168
-
169
- #### Software
170
-
171
- [More Information Needed]
172
 
173
- ## Citation [optional]
 
174
 
175
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
176
 
177
- **BibTeX:**
 
178
 
179
- [More Information Needed]
 
 
 
 
180
 
181
- **APA:**
182
 
183
- [More Information Needed]
184
 
185
- ## Glossary [optional]
186
 
187
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
188
 
189
- [More Information Needed]
190
 
191
- ## More Information [optional]
192
 
193
- [More Information Needed]
 
 
 
 
194
 
195
- ## Model Card Authors [optional]
196
 
197
- [More Information Needed]
198
 
199
- ## Model Card Contact
 
 
 
 
 
 
 
200
 
201
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
3
  library_name: transformers
4
+ model_name: grpo_output
5
  tags:
6
+ - generated_from_trainer
7
  - grpo
8
+ - hf_jobs
9
+ - trl
10
+ licence: license
11
  ---
12
 
13
+ # Model Card for grpo_output
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ This model is a fine-tuned version of [Qwen/Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct).
16
+ It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
+ ## Quick start
19
 
20
+ ```python
21
+ from transformers import pipeline
22
 
23
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
+ generator = pipeline("text-generation", model="None", device="cuda")
25
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
+ print(output["generated_text"])
27
+ ```
28
 
29
+ ## Training procedure
30
 
31
+
32
 
 
33
 
 
34
 
35
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
36
 
37
+ ### Framework versions
38
 
39
+ - TRL: 1.2.0
40
+ - Transformers: 5.6.2
41
+ - Pytorch: 2.6.0+cu126
42
+ - Datasets: 4.8.4
43
+ - Tokenizers: 0.22.2
44
 
45
+ ## Citations
46
 
47
+ Cite GRPO as:
48
 
49
+ ```bibtex
50
+ @article{shao2024deepseekmath,
51
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
52
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
53
+ year = 2024,
54
+ eprint = {arXiv:2402.03300},
55
+ }
56
+ ```
57
 
58
+ Cite TRL as:
59
+
60
+ ```bibtex
61
+ @software{vonwerra2020trl,
62
+ title = {{TRL: Transformers Reinforcement Learning}},
63
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
64
+ license = {Apache-2.0},
65
+ url = {https://github.com/huggingface/trl},
66
+ year = {2020}
67
+ }
68
+ ```
checkpoint-300/chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
checkpoint-300/config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": null,
7
+ "dtype": "float32",
8
+ "eos_token_id": 151645,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 896,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4864,
13
+ "layer_types": [
14
+ "full_attention",
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention"
38
+ ],
39
+ "max_position_embeddings": 32768,
40
+ "max_window_layers": 24,
41
+ "model_type": "qwen2",
42
+ "num_attention_heads": 14,
43
+ "num_hidden_layers": 24,
44
+ "num_key_value_heads": 2,
45
+ "pad_token_id": 151643,
46
+ "rms_norm_eps": 1e-06,
47
+ "rope_parameters": {
48
+ "rope_theta": 1000000.0,
49
+ "rope_type": "default"
50
+ },
51
+ "sliding_window": null,
52
+ "tie_word_embeddings": true,
53
+ "transformers_version": "5.6.2",
54
+ "use_cache": false,
55
+ "use_sliding_window": false,
56
+ "vocab_size": 151936
57
+ }
checkpoint-300/generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151643
6
+ ],
7
+ "pad_token_id": 151643,
8
+ "repetition_penalty": 1.05,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "5.6.2"
13
+ }
checkpoint-300/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:275ebac6d53742a54d972d5a2fdf93a64ab774cf50af3c817e02a1376655c840
3
+ size 1976163472
checkpoint-300/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3577491d619a3e9b2d76cba84e6eee9cdffd5bb2784ebc6a1e3453f2ce9f8021
3
+ size 3952505274
checkpoint-300/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa9a1789e81962b242729edabc19959b88ccde1eb3dfdbc7cd826e14f85a76f9
3
+ size 14244
checkpoint-300/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:262427cf509faa4beebbf93a0c170cf18cb00c5f988d14d843ea44ed3b3c2cae
3
+ size 1064
checkpoint-300/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
+ size 11421892
checkpoint-300/tokenizer_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": false,
24
+ "local_files_only": false,
25
+ "model_max_length": 32768,
26
+ "pad_token": "<|endoftext|>",
27
+ "padding_side": "left",
28
+ "split_special_tokens": false,
29
+ "tokenizer_class": "Qwen2Tokenizer",
30
+ "truncation_side": "left",
31
+ "unk_token": null
32
+ }
checkpoint-300/trainer_state.json ADDED
@@ -0,0 +1,1803 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 1.8404907975460123,
6
+ "eval_steps": 500,
7
+ "global_step": 300,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "clip_ratio/high_max": 0.0,
14
+ "clip_ratio/high_mean": 0.0,
15
+ "clip_ratio/low_mean": 0.0,
16
+ "clip_ratio/low_min": 0.0,
17
+ "clip_ratio/region_mean": 0.0,
18
+ "completions/clipped_ratio": 0.75,
19
+ "completions/max_length": 512.0,
20
+ "completions/max_terminated_length": 37.0,
21
+ "completions/mean_length": 393.25,
22
+ "completions/mean_terminated_length": 37.0,
23
+ "completions/min_length": 37.0,
24
+ "completions/min_terminated_length": 37.0,
25
+ "entropy": 1.4897738695144653,
26
+ "epoch": 0.006134969325153374,
27
+ "frac_reward_zero_std": 0.5,
28
+ "grad_norm": 2.2988293170928955,
29
+ "learning_rate": 5e-07,
30
+ "loss": -0.21252349019050598,
31
+ "num_tokens": 3567.0,
32
+ "reward": -0.3424999713897705,
33
+ "reward_std": 0.01500000525265932,
34
+ "rewards/format_reward/mean": 0.02500000037252903,
35
+ "rewards/format_reward/std": 0.05000000074505806,
36
+ "rewards/security_audit_reward/mean": -0.5,
37
+ "rewards/security_audit_reward/std": 0.0,
38
+ "step": 1,
39
+ "step_time": 39.508622552999896
40
+ },
41
+ {
42
+ "clip_ratio/high_max": 0.0,
43
+ "clip_ratio/high_mean": 0.0,
44
+ "clip_ratio/low_mean": 0.0,
45
+ "clip_ratio/low_min": 0.0,
46
+ "clip_ratio/region_mean": 0.0,
47
+ "completions/clipped_ratio": 0.5,
48
+ "completions/max_length": 512.0,
49
+ "completions/max_terminated_length": 248.75,
50
+ "completions/mean_length": 389.625,
51
+ "completions/mean_terminated_length": 192.79166793823242,
52
+ "completions/min_length": 272.75,
53
+ "completions/min_terminated_length": 144.75,
54
+ "entropy": 1.363443061709404,
55
+ "epoch": 0.03067484662576687,
56
+ "frac_reward_zero_std": 0.375,
57
+ "grad_norm": 4.688082218170166,
58
+ "learning_rate": 4.938650306748465e-07,
59
+ "loss": 0.04808004945516586,
60
+ "num_tokens": 17675.0,
61
+ "reward": -0.2981249839067459,
62
+ "reward_std": 0.08178356755524874,
63
+ "rewards/format_reward/mean": 0.10000000381842256,
64
+ "rewards/format_reward/std": 0.12774468399584293,
65
+ "rewards/security_audit_reward/mean": -0.46875,
66
+ "rewards/security_audit_reward/std": 0.0625,
67
+ "step": 5,
68
+ "step_time": 38.500043476749966
69
+ },
70
+ {
71
+ "clip_ratio/high_max": 0.0,
72
+ "clip_ratio/high_mean": 0.0,
73
+ "clip_ratio/low_mean": 0.0,
74
+ "clip_ratio/low_min": 0.0,
75
+ "clip_ratio/region_mean": 0.0,
76
+ "completions/clipped_ratio": 0.65,
77
+ "completions/max_length": 512.0,
78
+ "completions/max_terminated_length": 345.6,
79
+ "completions/mean_length": 463.0,
80
+ "completions/mean_terminated_length": 305.6,
81
+ "completions/min_length": 363.8,
82
+ "completions/min_terminated_length": 261.4,
83
+ "entropy": 1.4113845229148865,
84
+ "epoch": 0.06134969325153374,
85
+ "frac_reward_zero_std": 0.4,
86
+ "grad_norm": 3.245452880859375,
87
+ "learning_rate": 4.86196319018405e-07,
88
+ "loss": -0.00041331946849823,
89
+ "num_tokens": 37093.0,
90
+ "reward": -0.29424998760223386,
91
+ "reward_std": 0.08391451295465231,
92
+ "rewards/format_reward/mean": 0.12750000804662703,
93
+ "rewards/format_reward/std": 0.16304838731884957,
94
+ "rewards/security_audit_reward/mean": -0.475,
95
+ "rewards/security_audit_reward/std": 0.05,
96
+ "step": 10,
97
+ "step_time": 39.192330704800135
98
+ },
99
+ {
100
+ "clip_ratio/high_max": 0.0,
101
+ "clip_ratio/high_mean": 0.0,
102
+ "clip_ratio/low_mean": 0.0,
103
+ "clip_ratio/low_min": 0.0,
104
+ "clip_ratio/region_mean": 0.0,
105
+ "completions/clipped_ratio": 0.6,
106
+ "completions/max_length": 512.0,
107
+ "completions/max_terminated_length": 394.0,
108
+ "completions/mean_length": 455.5,
109
+ "completions/mean_terminated_length": 352.9,
110
+ "completions/min_length": 311.8,
111
+ "completions/min_terminated_length": 311.8,
112
+ "entropy": 1.179759132862091,
113
+ "epoch": 0.09202453987730061,
114
+ "frac_reward_zero_std": 0.7,
115
+ "grad_norm": 2.9624693393707275,
116
+ "learning_rate": 4.785276073619632e-07,
117
+ "loss": 0.03452911972999573,
118
+ "num_tokens": 55311.0,
119
+ "reward": -0.2887499898672104,
120
+ "reward_std": 0.09658594038337469,
121
+ "rewards/format_reward/mean": 0.0875,
122
+ "rewards/format_reward/std": 0.08947573080658913,
123
+ "rewards/security_audit_reward/mean": -0.45,
124
+ "rewards/security_audit_reward/std": 0.1,
125
+ "step": 15,
126
+ "step_time": 38.30515608799997
127
+ },
128
+ {
129
+ "clip_ratio/high_max": 0.0,
130
+ "clip_ratio/high_mean": 0.0,
131
+ "clip_ratio/low_mean": 0.0,
132
+ "clip_ratio/low_min": 0.0,
133
+ "clip_ratio/region_mean": 0.0,
134
+ "completions/clipped_ratio": 0.45,
135
+ "completions/max_length": 512.0,
136
+ "completions/max_terminated_length": 395.2,
137
+ "completions/mean_length": 416.9,
138
+ "completions/mean_terminated_length": 328.76666870117185,
139
+ "completions/min_length": 260.8,
140
+ "completions/min_terminated_length": 260.8,
141
+ "entropy": 1.298638153076172,
142
+ "epoch": 0.12269938650306748,
143
+ "frac_reward_zero_std": 0.2,
144
+ "grad_norm": 4.034470081329346,
145
+ "learning_rate": 4.7085889570552147e-07,
146
+ "loss": -0.008246073126792907,
147
+ "num_tokens": 72771.0,
148
+ "reward": -0.23124998807907104,
149
+ "reward_std": 0.16768747363239528,
150
+ "rewards/format_reward/mean": 0.19750000424683095,
151
+ "rewards/format_reward/std": 0.2057904489338398,
152
+ "rewards/security_audit_reward/mean": -0.4149999976158142,
153
+ "rewards/security_audit_reward/std": 0.16999999880790712,
154
+ "step": 20,
155
+ "step_time": 37.87772348239996
156
+ },
157
+ {
158
+ "clip_ratio/high_max": 0.0,
159
+ "clip_ratio/high_mean": 0.0,
160
+ "clip_ratio/low_mean": 0.0,
161
+ "clip_ratio/low_min": 0.0,
162
+ "clip_ratio/region_mean": 0.0,
163
+ "completions/clipped_ratio": 0.3,
164
+ "completions/max_length": 512.0,
165
+ "completions/max_terminated_length": 423.4,
166
+ "completions/mean_length": 382.1,
167
+ "completions/mean_terminated_length": 334.1000091552734,
168
+ "completions/min_length": 236.0,
169
+ "completions/min_terminated_length": 236.0,
170
+ "entropy": 1.317835807800293,
171
+ "epoch": 0.15337423312883436,
172
+ "frac_reward_zero_std": 0.3,
173
+ "grad_norm": 2.853423595428467,
174
+ "learning_rate": 4.631901840490797e-07,
175
+ "loss": -0.013739901781082153,
176
+ "num_tokens": 89889.0,
177
+ "reward": -0.2974999874830246,
178
+ "reward_std": 0.15671177953481674,
179
+ "rewards/format_reward/mean": 0.17500000596046447,
180
+ "rewards/format_reward/std": 0.18444484770298003,
181
+ "rewards/security_audit_reward/mean": -0.5,
182
+ "rewards/security_audit_reward/std": 0.15773502588272095,
183
+ "step": 25,
184
+ "step_time": 38.74009619139997
185
+ },
186
+ {
187
+ "clip_ratio/high_max": 0.0,
188
+ "clip_ratio/high_mean": 0.0,
189
+ "clip_ratio/low_mean": 0.0,
190
+ "clip_ratio/low_min": 0.0,
191
+ "clip_ratio/region_mean": 0.0,
192
+ "completions/clipped_ratio": 0.65,
193
+ "completions/max_length": 512.0,
194
+ "completions/max_terminated_length": 346.0,
195
+ "completions/mean_length": 463.35,
196
+ "completions/mean_terminated_length": 295.3,
197
+ "completions/min_length": 337.8,
198
+ "completions/min_terminated_length": 235.4,
199
+ "entropy": 1.1444598376750945,
200
+ "epoch": 0.18404907975460122,
201
+ "frac_reward_zero_std": 0.2,
202
+ "grad_norm": 3.64375901222229,
203
+ "learning_rate": 4.55521472392638e-07,
204
+ "loss": -0.03970654606819153,
205
+ "num_tokens": 108664.0,
206
+ "reward": -0.3184999763965607,
207
+ "reward_std": 0.04019503518939018,
208
+ "rewards/format_reward/mean": 0.10499999970197678,
209
+ "rewards/format_reward/std": 0.13398344144225122,
210
+ "rewards/security_audit_reward/mean": -0.5,
211
+ "rewards/security_audit_reward/std": 0.0,
212
+ "step": 30,
213
+ "step_time": 38.56538706479987
214
+ },
215
+ {
216
+ "clip_ratio/high_max": 0.0,
217
+ "clip_ratio/high_mean": 0.0,
218
+ "clip_ratio/low_mean": 0.0,
219
+ "clip_ratio/low_min": 0.0,
220
+ "clip_ratio/region_mean": 0.0,
221
+ "completions/clipped_ratio": 0.4,
222
+ "completions/max_length": 504.2,
223
+ "completions/max_terminated_length": 454.6,
224
+ "completions/mean_length": 421.25,
225
+ "completions/mean_terminated_length": 386.0,
226
+ "completions/min_length": 328.6,
227
+ "completions/min_terminated_length": 328.6,
228
+ "entropy": 1.3522289156913758,
229
+ "epoch": 0.2147239263803681,
230
+ "frac_reward_zero_std": 0.3,
231
+ "grad_norm": 3.4385552406311035,
232
+ "learning_rate": 4.4785276073619634e-07,
233
+ "loss": -0.06348788738250732,
234
+ "num_tokens": 126953.0,
235
+ "reward": -0.32824997901916503,
236
+ "reward_std": 0.03220053892582655,
237
+ "rewards/format_reward/mean": 0.07250000201165677,
238
+ "rewards/format_reward/std": 0.10733511671423912,
239
+ "rewards/security_audit_reward/mean": -0.5,
240
+ "rewards/security_audit_reward/std": 0.0,
241
+ "step": 35,
242
+ "step_time": 37.87626404739986
243
+ },
244
+ {
245
+ "clip_ratio/high_max": 0.0,
246
+ "clip_ratio/high_mean": 0.0,
247
+ "clip_ratio/low_mean": 0.0,
248
+ "clip_ratio/low_min": 0.0,
249
+ "clip_ratio/region_mean": 0.0,
250
+ "completions/clipped_ratio": 0.55,
251
+ "completions/max_length": 505.2,
252
+ "completions/max_terminated_length": 334.4,
253
+ "completions/mean_length": 414.9,
254
+ "completions/mean_terminated_length": 240.3,
255
+ "completions/min_length": 243.0,
256
+ "completions/min_terminated_length": 140.6,
257
+ "entropy": 1.230024951696396,
258
+ "epoch": 0.24539877300613497,
259
+ "frac_reward_zero_std": 0.3,
260
+ "grad_norm": 4.400479793548584,
261
+ "learning_rate": 4.401840490797546e-07,
262
+ "loss": 0.11927952766418456,
263
+ "num_tokens": 144785.0,
264
+ "reward": -0.2897499829530716,
265
+ "reward_std": 0.12973095811903476,
266
+ "rewards/format_reward/mean": 0.14250000044703484,
267
+ "rewards/format_reward/std": 0.14365934804081917,
268
+ "rewards/security_audit_reward/mean": -0.475,
269
+ "rewards/security_audit_reward/std": 0.13164966106414794,
270
+ "step": 40,
271
+ "step_time": 37.8069536416001
272
+ },
273
+ {
274
+ "clip_ratio/high_max": 0.0,
275
+ "clip_ratio/high_mean": 0.0,
276
+ "clip_ratio/low_mean": 0.0,
277
+ "clip_ratio/low_min": 0.0,
278
+ "clip_ratio/region_mean": 0.0,
279
+ "completions/clipped_ratio": 0.6,
280
+ "completions/max_length": 472.2,
281
+ "completions/max_terminated_length": 190.6,
282
+ "completions/mean_length": 412.7,
283
+ "completions/mean_terminated_length": 160.85,
284
+ "completions/min_length": 333.8,
285
+ "completions/min_terminated_length": 129.0,
286
+ "entropy": 1.2133947968482972,
287
+ "epoch": 0.27607361963190186,
288
+ "frac_reward_zero_std": 0.1,
289
+ "grad_norm": 4.325937271118164,
290
+ "learning_rate": 4.3251533742331285e-07,
291
+ "loss": 0.025146520137786864,
292
+ "num_tokens": 162443.0,
293
+ "reward": -0.1574999876320362,
294
+ "reward_std": 0.2636621415615082,
295
+ "rewards/format_reward/mean": 0.24500001072883607,
296
+ "rewards/format_reward/std": 0.23762110471725464,
297
+ "rewards/security_audit_reward/mean": -0.32999999523162843,
298
+ "rewards/security_audit_reward/std": 0.28574271202087403,
299
+ "step": 45,
300
+ "step_time": 34.90182834920015
301
+ },
302
+ {
303
+ "clip_ratio/high_max": 0.0,
304
+ "clip_ratio/high_mean": 0.0,
305
+ "clip_ratio/low_mean": 0.0,
306
+ "clip_ratio/low_min": 0.0,
307
+ "clip_ratio/region_mean": 0.0,
308
+ "completions/clipped_ratio": 0.55,
309
+ "completions/max_length": 512.0,
310
+ "completions/max_terminated_length": 308.6,
311
+ "completions/mean_length": 397.05,
312
+ "completions/mean_terminated_length": 259.2,
313
+ "completions/min_length": 203.0,
314
+ "completions/min_terminated_length": 203.0,
315
+ "entropy": 1.4294291973114013,
316
+ "epoch": 0.3067484662576687,
317
+ "frac_reward_zero_std": 0.4,
318
+ "grad_norm": 3.9505743980407715,
319
+ "learning_rate": 4.2484662576687116e-07,
320
+ "loss": -0.08058007955551147,
321
+ "num_tokens": 180200.0,
322
+ "reward": -0.29249998927116394,
323
+ "reward_std": 0.10127481501549482,
324
+ "rewards/format_reward/mean": 0.0750000026077032,
325
+ "rewards/format_reward/std": 0.1127780631184578,
326
+ "rewards/security_audit_reward/mean": -0.45,
327
+ "rewards/security_audit_reward/std": 0.1,
328
+ "step": 50,
329
+ "step_time": 38.750808009400046
330
+ },
331
+ {
332
+ "clip_ratio/high_max": 0.0,
333
+ "clip_ratio/high_mean": 0.0,
334
+ "clip_ratio/low_mean": 0.0,
335
+ "clip_ratio/low_min": 0.0,
336
+ "clip_ratio/region_mean": 0.0,
337
+ "completions/clipped_ratio": 0.5,
338
+ "completions/max_length": 512.0,
339
+ "completions/max_terminated_length": 389.8,
340
+ "completions/mean_length": 404.65,
341
+ "completions/mean_terminated_length": 297.43333740234374,
342
+ "completions/min_length": 192.8,
343
+ "completions/min_terminated_length": 192.8,
344
+ "entropy": 1.2564165532588958,
345
+ "epoch": 0.3374233128834356,
346
+ "frac_reward_zero_std": 0.3,
347
+ "grad_norm": 3.3762269020080566,
348
+ "learning_rate": 4.171779141104294e-07,
349
+ "loss": -0.030467823147773743,
350
+ "num_tokens": 198109.0,
351
+ "reward": -0.2542499825358391,
352
+ "reward_std": 0.07489922866225243,
353
+ "rewards/format_reward/mean": 0.20250000841915608,
354
+ "rewards/format_reward/std": 0.1368803471326828,
355
+ "rewards/security_audit_reward/mean": -0.45,
356
+ "rewards/security_audit_reward/std": 0.05773502588272095,
357
+ "step": 55,
358
+ "step_time": 38.411275500399825
359
+ },
360
+ {
361
+ "clip_ratio/high_max": 0.0,
362
+ "clip_ratio/high_mean": 0.0,
363
+ "clip_ratio/low_mean": 0.0,
364
+ "clip_ratio/low_min": 0.0,
365
+ "clip_ratio/region_mean": 0.0,
366
+ "completions/clipped_ratio": 0.35,
367
+ "completions/max_length": 496.2,
368
+ "completions/max_terminated_length": 448.8,
369
+ "completions/mean_length": 394.3,
370
+ "completions/mean_terminated_length": 358.6166687011719,
371
+ "completions/min_length": 285.4,
372
+ "completions/min_terminated_length": 285.4,
373
+ "entropy": 1.2620218694210052,
374
+ "epoch": 0.36809815950920244,
375
+ "frac_reward_zero_std": 0.2,
376
+ "grad_norm": 2.8227944374084473,
377
+ "learning_rate": 4.095092024539877e-07,
378
+ "loss": 0.039707571268081665,
379
+ "num_tokens": 215747.0,
380
+ "reward": -0.2729999750852585,
381
+ "reward_std": 0.13599938787519933,
382
+ "rewards/format_reward/mean": 0.14000000432133675,
383
+ "rewards/format_reward/std": 0.14343783408403396,
384
+ "rewards/security_audit_reward/mean": -0.45,
385
+ "rewards/security_audit_reward/std": 0.1393846869468689,
386
+ "step": 60,
387
+ "step_time": 37.50645367139987
388
+ },
389
+ {
390
+ "clip_ratio/high_max": 0.0,
391
+ "clip_ratio/high_mean": 0.0,
392
+ "clip_ratio/low_mean": 0.0,
393
+ "clip_ratio/low_min": 0.0,
394
+ "clip_ratio/region_mean": 0.0,
395
+ "completions/clipped_ratio": 0.35,
396
+ "completions/max_length": 484.8,
397
+ "completions/max_terminated_length": 334.8,
398
+ "completions/mean_length": 381.35,
399
+ "completions/mean_terminated_length": 255.98333740234375,
400
+ "completions/min_length": 275.6,
401
+ "completions/min_terminated_length": 173.2,
402
+ "entropy": 1.2798833012580872,
403
+ "epoch": 0.3987730061349693,
404
+ "frac_reward_zero_std": 0.2,
405
+ "grad_norm": 3.4819753170013428,
406
+ "learning_rate": 4.01840490797546e-07,
407
+ "loss": -0.06275686025619506,
408
+ "num_tokens": 233162.0,
409
+ "reward": -0.201749986410141,
410
+ "reward_std": 0.2016347900032997,
411
+ "rewards/format_reward/mean": 0.20250000804662704,
412
+ "rewards/format_reward/std": 0.22137173414230346,
413
+ "rewards/security_audit_reward/mean": -0.375,
414
+ "rewards/security_audit_reward/std": 0.20773502588272094,
415
+ "step": 65,
416
+ "step_time": 37.139902984000216
417
+ },
418
+ {
419
+ "clip_ratio/high_max": 0.0,
420
+ "clip_ratio/high_mean": 0.0,
421
+ "clip_ratio/low_mean": 0.0,
422
+ "clip_ratio/low_min": 0.0,
423
+ "clip_ratio/region_mean": 0.0,
424
+ "completions/clipped_ratio": 0.6,
425
+ "completions/max_length": 512.0,
426
+ "completions/max_terminated_length": 289.8,
427
+ "completions/mean_length": 408.45,
428
+ "completions/mean_terminated_length": 251.7,
429
+ "completions/min_length": 207.0,
430
+ "completions/min_terminated_length": 207.0,
431
+ "entropy": 1.2134525895118713,
432
+ "epoch": 0.4294478527607362,
433
+ "frac_reward_zero_std": 0.3,
434
+ "grad_norm": 4.019806861877441,
435
+ "learning_rate": 3.941717791411043e-07,
436
+ "loss": 0.08099154829978943,
437
+ "num_tokens": 251321.0,
438
+ "reward": -0.27599998414516447,
439
+ "reward_std": 0.08945702444761991,
440
+ "rewards/format_reward/mean": 0.1300000037997961,
441
+ "rewards/format_reward/std": 0.1645726040005684,
442
+ "rewards/security_audit_reward/mean": -0.45,
443
+ "rewards/security_audit_reward/std": 0.05773502588272095,
444
+ "step": 70,
445
+ "step_time": 38.02127088899997
446
+ },
447
+ {
448
+ "clip_ratio/high_max": 0.0,
449
+ "clip_ratio/high_mean": 0.0,
450
+ "clip_ratio/low_mean": 0.0,
451
+ "clip_ratio/low_min": 0.0,
452
+ "clip_ratio/region_mean": 0.0,
453
+ "completions/clipped_ratio": 0.5,
454
+ "completions/max_length": 459.8,
455
+ "completions/max_terminated_length": 206.2,
456
+ "completions/mean_length": 348.9,
457
+ "completions/mean_terminated_length": 176.9,
458
+ "completions/min_length": 148.6,
459
+ "completions/min_terminated_length": 148.6,
460
+ "entropy": 1.3179432690143584,
461
+ "epoch": 0.4601226993865031,
462
+ "frac_reward_zero_std": 0.0,
463
+ "grad_norm": 6.28598690032959,
464
+ "learning_rate": 3.8650306748466255e-07,
465
+ "loss": -0.11171818971633911,
466
+ "num_tokens": 267725.0,
467
+ "reward": -0.19474998638033866,
468
+ "reward_std": 0.17031802013516426,
469
+ "rewards/format_reward/mean": 0.23750000447034836,
470
+ "rewards/format_reward/std": 0.17114628925919534,
471
+ "rewards/security_audit_reward/mean": -0.3800000011920929,
472
+ "rewards/security_audit_reward/std": 0.180902099609375,
473
+ "step": 75,
474
+ "step_time": 34.450158203000136
475
+ },
476
+ {
477
+ "clip_ratio/high_max": 0.0,
478
+ "clip_ratio/high_mean": 0.0,
479
+ "clip_ratio/low_mean": 0.0,
480
+ "clip_ratio/low_min": 0.0,
481
+ "clip_ratio/region_mean": 0.0,
482
+ "completions/clipped_ratio": 0.5,
483
+ "completions/max_length": 512.0,
484
+ "completions/max_terminated_length": 332.8,
485
+ "completions/mean_length": 400.75,
486
+ "completions/mean_terminated_length": 289.06666870117186,
487
+ "completions/min_length": 251.0,
488
+ "completions/min_terminated_length": 251.0,
489
+ "entropy": 1.1514661133289337,
490
+ "epoch": 0.49079754601226994,
491
+ "frac_reward_zero_std": 0.4,
492
+ "grad_norm": 2.8479247093200684,
493
+ "learning_rate": 3.788343558282208e-07,
494
+ "loss": 0.03145935535430908,
495
+ "num_tokens": 285726.0,
496
+ "reward": -0.2569999933242798,
497
+ "reward_std": 0.15333212018013,
498
+ "rewards/format_reward/mean": 0.1350000023841858,
499
+ "rewards/format_reward/std": 0.18636635541915894,
500
+ "rewards/security_audit_reward/mean": -0.425,
501
+ "rewards/security_audit_reward/std": 0.15,
502
+ "step": 80,
503
+ "step_time": 38.869779922999626
504
+ },
505
+ {
506
+ "clip_ratio/high_max": 0.0,
507
+ "clip_ratio/high_mean": 0.0,
508
+ "clip_ratio/low_mean": 0.0,
509
+ "clip_ratio/low_min": 0.0,
510
+ "clip_ratio/region_mean": 0.0,
511
+ "completions/clipped_ratio": 0.5,
512
+ "completions/max_length": 512.0,
513
+ "completions/max_terminated_length": 321.2,
514
+ "completions/mean_length": 398.5,
515
+ "completions/mean_terminated_length": 253.0,
516
+ "completions/min_length": 183.0,
517
+ "completions/min_terminated_length": 183.0,
518
+ "entropy": 1.244500571489334,
519
+ "epoch": 0.5214723926380368,
520
+ "frac_reward_zero_std": 0.3,
521
+ "grad_norm": 2.146970272064209,
522
+ "learning_rate": 3.7116564417177916e-07,
523
+ "loss": 0.06171210408210755,
524
+ "num_tokens": 304148.0,
525
+ "reward": -0.22524999380111693,
526
+ "reward_std": 0.19191497713327407,
527
+ "rewards/format_reward/mean": 0.18250000327825547,
528
+ "rewards/format_reward/std": 0.19969657957553863,
529
+ "rewards/security_audit_reward/mean": -0.4,
530
+ "rewards/security_audit_reward/std": 0.2,
531
+ "step": 85,
532
+ "step_time": 39.297288996000134
533
+ },
534
+ {
535
+ "clip_ratio/high_max": 0.0,
536
+ "clip_ratio/high_mean": 0.0,
537
+ "clip_ratio/low_mean": 0.0,
538
+ "clip_ratio/low_min": 0.0,
539
+ "clip_ratio/region_mean": 0.0,
540
+ "completions/clipped_ratio": 0.55,
541
+ "completions/max_length": 512.0,
542
+ "completions/max_terminated_length": 360.0,
543
+ "completions/mean_length": 419.5,
544
+ "completions/mean_terminated_length": 325.6,
545
+ "completions/min_length": 291.2,
546
+ "completions/min_terminated_length": 291.2,
547
+ "entropy": 1.206581747531891,
548
+ "epoch": 0.5521472392638037,
549
+ "frac_reward_zero_std": 0.2,
550
+ "grad_norm": 2.1158456802368164,
551
+ "learning_rate": 3.634969325153374e-07,
552
+ "loss": -0.06664568185806274,
553
+ "num_tokens": 321680.0,
554
+ "reward": -0.23324998915195466,
555
+ "reward_std": 0.1919491995126009,
556
+ "rewards/format_reward/mean": 0.1675000049173832,
557
+ "rewards/format_reward/std": 0.19863576367497443,
558
+ "rewards/security_audit_reward/mean": -0.40499999523162844,
559
+ "rewards/security_audit_reward/std": 0.1899999976158142,
560
+ "step": 90,
561
+ "step_time": 38.49561594039933
562
+ },
563
+ {
564
+ "clip_ratio/high_max": 0.0,
565
+ "clip_ratio/high_mean": 0.0,
566
+ "clip_ratio/low_mean": 0.0,
567
+ "clip_ratio/low_min": 0.0,
568
+ "clip_ratio/region_mean": 0.0,
569
+ "completions/clipped_ratio": 0.4,
570
+ "completions/max_length": 469.2,
571
+ "completions/max_terminated_length": 364.0,
572
+ "completions/mean_length": 383.2,
573
+ "completions/mean_terminated_length": 291.8333374023438,
574
+ "completions/min_length": 217.6,
575
+ "completions/min_terminated_length": 217.6,
576
+ "entropy": 1.217250692844391,
577
+ "epoch": 0.5828220858895705,
578
+ "frac_reward_zero_std": 0.4,
579
+ "grad_norm": 4.098232269287109,
580
+ "learning_rate": 3.558282208588957e-07,
581
+ "loss": 0.05211906433105469,
582
+ "num_tokens": 339350.0,
583
+ "reward": -0.2119999945163727,
584
+ "reward_std": 0.19894140996038914,
585
+ "rewards/format_reward/mean": 0.1799999989569187,
586
+ "rewards/format_reward/std": 0.23350853994488716,
587
+ "rewards/security_audit_reward/mean": -0.37999999821186065,
588
+ "rewards/security_audit_reward/std": 0.1911805212497711,
589
+ "step": 95,
590
+ "step_time": 35.83687614579994
591
+ },
592
+ {
593
+ "clip_ratio/high_max": 0.0,
594
+ "clip_ratio/high_mean": 0.0,
595
+ "clip_ratio/low_mean": 0.0,
596
+ "clip_ratio/low_min": 0.0,
597
+ "clip_ratio/region_mean": 0.0,
598
+ "completions/clipped_ratio": 0.5,
599
+ "completions/max_length": 512.0,
600
+ "completions/max_terminated_length": 345.4,
601
+ "completions/mean_length": 377.45,
602
+ "completions/mean_terminated_length": 290.6333343505859,
603
+ "completions/min_length": 240.4,
604
+ "completions/min_terminated_length": 240.4,
605
+ "entropy": 1.2783292949199676,
606
+ "epoch": 0.6134969325153374,
607
+ "frac_reward_zero_std": 0.3,
608
+ "grad_norm": 2.361516237258911,
609
+ "learning_rate": 3.48159509202454e-07,
610
+ "loss": 0.06258203387260437,
611
+ "num_tokens": 356239.0,
612
+ "reward": -0.20649999231100083,
613
+ "reward_std": 0.18195689767599105,
614
+ "rewards/format_reward/mean": 0.24499999433755876,
615
+ "rewards/format_reward/std": 0.19310407042503358,
616
+ "rewards/security_audit_reward/mean": -0.4,
617
+ "rewards/security_audit_reward/std": 0.2,
618
+ "step": 100,
619
+ "step_time": 38.47846096040011
620
+ },
621
+ {
622
+ "clip_ratio/high_max": 0.0,
623
+ "clip_ratio/high_mean": 0.0,
624
+ "clip_ratio/low_mean": 0.0,
625
+ "clip_ratio/low_min": 0.0,
626
+ "clip_ratio/region_mean": 0.0,
627
+ "completions/clipped_ratio": 0.3,
628
+ "completions/max_length": 463.2,
629
+ "completions/max_terminated_length": 394.6,
630
+ "completions/mean_length": 335.8,
631
+ "completions/mean_terminated_length": 268.3,
632
+ "completions/min_length": 139.6,
633
+ "completions/min_terminated_length": 139.6,
634
+ "entropy": 1.2529696226119995,
635
+ "epoch": 0.6441717791411042,
636
+ "frac_reward_zero_std": 0.2,
637
+ "grad_norm": 3.356074094772339,
638
+ "learning_rate": 3.4049079754601224e-07,
639
+ "loss": 0.003340443968772888,
640
+ "num_tokens": 373237.0,
641
+ "reward": -0.2567499876022339,
642
+ "reward_std": 0.27417250275611876,
643
+ "rewards/format_reward/mean": 0.14750000461935997,
644
+ "rewards/format_reward/std": 0.19759280756115913,
645
+ "rewards/security_audit_reward/mean": -0.4299999952316284,
646
+ "rewards/security_audit_reward/std": 0.3186576545238495,
647
+ "step": 105,
648
+ "step_time": 35.43083410320014
649
+ },
650
+ {
651
+ "clip_ratio/high_max": 0.0,
652
+ "clip_ratio/high_mean": 0.0,
653
+ "clip_ratio/low_mean": 0.0,
654
+ "clip_ratio/low_min": 0.0,
655
+ "clip_ratio/region_mean": 0.0,
656
+ "completions/clipped_ratio": 0.45,
657
+ "completions/max_length": 512.0,
658
+ "completions/max_terminated_length": 372.8,
659
+ "completions/mean_length": 419.15,
660
+ "completions/mean_terminated_length": 319.2666748046875,
661
+ "completions/min_length": 266.8,
662
+ "completions/min_terminated_length": 266.8,
663
+ "entropy": 1.1685741186141967,
664
+ "epoch": 0.6748466257668712,
665
+ "frac_reward_zero_std": 0.2,
666
+ "grad_norm": 4.318619728088379,
667
+ "learning_rate": 3.3282208588957055e-07,
668
+ "loss": -0.026089027523994446,
669
+ "num_tokens": 391784.0,
670
+ "reward": -0.2662499874830246,
671
+ "reward_std": 0.07884115856140853,
672
+ "rewards/format_reward/mean": 0.16250000558793545,
673
+ "rewards/format_reward/std": 0.14070439487695693,
674
+ "rewards/security_audit_reward/mean": -0.45,
675
+ "rewards/security_audit_reward/std": 0.05773502588272095,
676
+ "step": 110,
677
+ "step_time": 38.89281254739999
678
+ },
679
+ {
680
+ "clip_ratio/high_max": 0.0,
681
+ "clip_ratio/high_mean": 0.0,
682
+ "clip_ratio/low_mean": 0.0,
683
+ "clip_ratio/low_min": 0.0,
684
+ "clip_ratio/region_mean": 0.0,
685
+ "completions/clipped_ratio": 0.3,
686
+ "completions/max_length": 468.0,
687
+ "completions/max_terminated_length": 337.6,
688
+ "completions/mean_length": 330.1,
689
+ "completions/mean_terminated_length": 240.4166687011719,
690
+ "completions/min_length": 160.8,
691
+ "completions/min_terminated_length": 160.8,
692
+ "entropy": 1.2954379856586455,
693
+ "epoch": 0.7055214723926381,
694
+ "frac_reward_zero_std": 0.2,
695
+ "grad_norm": 3.293928384780884,
696
+ "learning_rate": 3.251533742331288e-07,
697
+ "loss": 0.17276796102523803,
698
+ "num_tokens": 408446.0,
699
+ "reward": -0.22849999666213988,
700
+ "reward_std": 0.1390242099761963,
701
+ "rewards/format_reward/mean": 0.2300000011920929,
702
+ "rewards/format_reward/std": 0.23302415013313293,
703
+ "rewards/security_audit_reward/mean": -0.425,
704
+ "rewards/security_audit_reward/std": 0.10773502588272095,
705
+ "step": 115,
706
+ "step_time": 35.73082293679981
707
+ },
708
+ {
709
+ "clip_ratio/high_max": 0.0,
710
+ "clip_ratio/high_mean": 0.0,
711
+ "clip_ratio/low_mean": 0.0,
712
+ "clip_ratio/low_min": 0.0,
713
+ "clip_ratio/region_mean": 0.0,
714
+ "completions/clipped_ratio": 0.55,
715
+ "completions/max_length": 512.0,
716
+ "completions/max_terminated_length": 281.2,
717
+ "completions/mean_length": 400.5,
718
+ "completions/mean_terminated_length": 248.73333740234375,
719
+ "completions/min_length": 211.8,
720
+ "completions/min_terminated_length": 211.8,
721
+ "entropy": 1.2283548831939697,
722
+ "epoch": 0.7361963190184049,
723
+ "frac_reward_zero_std": 0.2,
724
+ "grad_norm": 2.632479190826416,
725
+ "learning_rate": 3.174846625766871e-07,
726
+ "loss": 0.05111231803894043,
727
+ "num_tokens": 426822.0,
728
+ "reward": -0.22074998915195465,
729
+ "reward_std": 0.15957241374999284,
730
+ "rewards/format_reward/mean": 0.1975000012665987,
731
+ "rewards/format_reward/std": 0.1847505249083042,
732
+ "rewards/security_audit_reward/mean": -0.4,
733
+ "rewards/security_audit_reward/std": 0.15773502588272095,
734
+ "step": 120,
735
+ "step_time": 39.14503100519996
736
+ },
737
+ {
738
+ "clip_ratio/high_max": 0.0,
739
+ "clip_ratio/high_mean": 0.0,
740
+ "clip_ratio/low_mean": 0.0,
741
+ "clip_ratio/low_min": 0.0,
742
+ "clip_ratio/region_mean": 0.0,
743
+ "completions/clipped_ratio": 0.5,
744
+ "completions/max_length": 512.0,
745
+ "completions/max_terminated_length": 340.2,
746
+ "completions/mean_length": 401.4,
747
+ "completions/mean_terminated_length": 236.03333740234376,
748
+ "completions/min_length": 245.2,
749
+ "completions/min_terminated_length": 142.8,
750
+ "entropy": 1.307636547088623,
751
+ "epoch": 0.7668711656441718,
752
+ "frac_reward_zero_std": 0.3,
753
+ "grad_norm": 5.566491603851318,
754
+ "learning_rate": 3.0981595092024537e-07,
755
+ "loss": 0.003215853124856949,
756
+ "num_tokens": 444322.0,
757
+ "reward": -0.11199999079108239,
758
+ "reward_std": 0.2506739288568497,
759
+ "rewards/format_reward/mean": 0.2449999988079071,
760
+ "rewards/format_reward/std": 0.20622505843639374,
761
+ "rewards/security_audit_reward/mean": -0.26500000059604645,
762
+ "rewards/security_audit_reward/std": 0.278915548324585,
763
+ "step": 125,
764
+ "step_time": 38.70229864360026
765
+ },
766
+ {
767
+ "clip_ratio/high_max": 0.0,
768
+ "clip_ratio/high_mean": 0.0,
769
+ "clip_ratio/low_mean": 0.0,
770
+ "clip_ratio/low_min": 0.0,
771
+ "clip_ratio/region_mean": 0.0,
772
+ "completions/clipped_ratio": 0.35,
773
+ "completions/max_length": 512.0,
774
+ "completions/max_terminated_length": 390.6,
775
+ "completions/mean_length": 367.0,
776
+ "completions/mean_terminated_length": 264.0000061035156,
777
+ "completions/min_length": 123.2,
778
+ "completions/min_terminated_length": 123.2,
779
+ "entropy": 1.248900693655014,
780
+ "epoch": 0.7975460122699386,
781
+ "frac_reward_zero_std": 0.2,
782
+ "grad_norm": 4.395384311676025,
783
+ "learning_rate": 3.021472392638036e-07,
784
+ "loss": 0.06482647061347961,
785
+ "num_tokens": 461894.0,
786
+ "reward": -0.2042499899864197,
787
+ "reward_std": 0.17003463432192803,
788
+ "rewards/format_reward/mean": 0.2525000125169754,
789
+ "rewards/format_reward/std": 0.21560870110988617,
790
+ "rewards/security_audit_reward/mean": -0.4,
791
+ "rewards/security_audit_reward/std": 0.15773502588272095,
792
+ "step": 130,
793
+ "step_time": 39.174271353800215
794
+ },
795
+ {
796
+ "clip_ratio/high_max": 0.0,
797
+ "clip_ratio/high_mean": 0.0,
798
+ "clip_ratio/low_mean": 0.0,
799
+ "clip_ratio/low_min": 0.0,
800
+ "clip_ratio/region_mean": 0.0,
801
+ "completions/clipped_ratio": 0.35,
802
+ "completions/max_length": 512.0,
803
+ "completions/max_terminated_length": 370.0,
804
+ "completions/mean_length": 359.95,
805
+ "completions/mean_terminated_length": 276.9666687011719,
806
+ "completions/min_length": 203.8,
807
+ "completions/min_terminated_length": 203.8,
808
+ "entropy": 1.3299469709396363,
809
+ "epoch": 0.8282208588957055,
810
+ "frac_reward_zero_std": 0.0,
811
+ "grad_norm": 4.3037519454956055,
812
+ "learning_rate": 2.94478527607362e-07,
813
+ "loss": 0.026153716444969177,
814
+ "num_tokens": 478783.0,
815
+ "reward": -0.19949999153614045,
816
+ "reward_std": 0.15296672135591508,
817
+ "rewards/format_reward/mean": 0.24500000178813935,
818
+ "rewards/format_reward/std": 0.2273508906364441,
819
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
820
+ "rewards/security_audit_reward/std": 0.12891554832458496,
821
+ "step": 135,
822
+ "step_time": 38.658669441000164
823
+ },
824
+ {
825
+ "clip_ratio/high_max": 0.0,
826
+ "clip_ratio/high_mean": 0.0,
827
+ "clip_ratio/low_mean": 0.0,
828
+ "clip_ratio/low_min": 0.0,
829
+ "clip_ratio/region_mean": 0.0,
830
+ "completions/clipped_ratio": 0.45,
831
+ "completions/max_length": 512.0,
832
+ "completions/max_terminated_length": 354.0,
833
+ "completions/mean_length": 380.3,
834
+ "completions/mean_terminated_length": 271.6666687011719,
835
+ "completions/min_length": 206.4,
836
+ "completions/min_terminated_length": 206.4,
837
+ "entropy": 1.0997539341449738,
838
+ "epoch": 0.8588957055214724,
839
+ "frac_reward_zero_std": 0.2,
840
+ "grad_norm": 2.3693976402282715,
841
+ "learning_rate": 2.8680981595092024e-07,
842
+ "loss": -0.01876506209373474,
843
+ "num_tokens": 496243.0,
844
+ "reward": -0.17974998727440833,
845
+ "reward_std": 0.18710523881018162,
846
+ "rewards/format_reward/mean": 0.21750000193715097,
847
+ "rewards/format_reward/std": 0.1843859799206257,
848
+ "rewards/security_audit_reward/mean": -0.35,
849
+ "rewards/security_audit_reward/std": 0.20347774028778076,
850
+ "step": 140,
851
+ "step_time": 39.171343391999834
852
+ },
853
+ {
854
+ "clip_ratio/high_max": 0.0,
855
+ "clip_ratio/high_mean": 0.0,
856
+ "clip_ratio/low_mean": 0.0,
857
+ "clip_ratio/low_min": 0.0,
858
+ "clip_ratio/region_mean": 0.0,
859
+ "completions/clipped_ratio": 0.35,
860
+ "completions/max_length": 456.2,
861
+ "completions/max_terminated_length": 355.2,
862
+ "completions/mean_length": 348.35,
863
+ "completions/mean_terminated_length": 270.3,
864
+ "completions/min_length": 164.8,
865
+ "completions/min_terminated_length": 164.8,
866
+ "entropy": 1.2017314374446868,
867
+ "epoch": 0.8895705521472392,
868
+ "frac_reward_zero_std": 0.3,
869
+ "grad_norm": 4.230531215667725,
870
+ "learning_rate": 2.791411042944785e-07,
871
+ "loss": 0.0029310762882232668,
872
+ "num_tokens": 513422.0,
873
+ "reward": -0.1637499898672104,
874
+ "reward_std": 0.2463478922843933,
875
+ "rewards/format_reward/mean": 0.2475000023841858,
876
+ "rewards/format_reward/std": 0.2085829883813858,
877
+ "rewards/security_audit_reward/mean": -0.34000000059604646,
878
+ "rewards/security_audit_reward/std": 0.26830023527145386,
879
+ "step": 145,
880
+ "step_time": 34.95672115479992
881
+ },
882
+ {
883
+ "clip_ratio/high_max": 0.0,
884
+ "clip_ratio/high_mean": 0.0,
885
+ "clip_ratio/low_mean": 0.0,
886
+ "clip_ratio/low_min": 0.0,
887
+ "clip_ratio/region_mean": 0.0,
888
+ "completions/clipped_ratio": 0.25,
889
+ "completions/max_length": 482.4,
890
+ "completions/max_terminated_length": 431.6,
891
+ "completions/mean_length": 355.35,
892
+ "completions/mean_terminated_length": 319.4166687011719,
893
+ "completions/min_length": 212.0,
894
+ "completions/min_terminated_length": 212.0,
895
+ "entropy": 1.258862280845642,
896
+ "epoch": 0.9202453987730062,
897
+ "frac_reward_zero_std": 0.2,
898
+ "grad_norm": 6.071740627288818,
899
+ "learning_rate": 2.714723926380368e-07,
900
+ "loss": 0.0822126567363739,
901
+ "num_tokens": 530643.0,
902
+ "reward": -0.1912499874830246,
903
+ "reward_std": 0.1670845106244087,
904
+ "rewards/format_reward/mean": 0.27249999940395353,
905
+ "rewards/format_reward/std": 0.1733592666685581,
906
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
907
+ "rewards/security_audit_reward/std": 0.17118052244186402,
908
+ "step": 150,
909
+ "step_time": 37.038782767599876
910
+ },
911
+ {
912
+ "clip_ratio/high_max": 0.0,
913
+ "clip_ratio/high_mean": 0.0,
914
+ "clip_ratio/low_mean": 0.0,
915
+ "clip_ratio/low_min": 0.0,
916
+ "clip_ratio/region_mean": 0.0,
917
+ "completions/clipped_ratio": 0.2,
918
+ "completions/max_length": 442.2,
919
+ "completions/max_terminated_length": 316.0,
920
+ "completions/mean_length": 283.75,
921
+ "completions/mean_terminated_length": 237.85000305175782,
922
+ "completions/min_length": 166.2,
923
+ "completions/min_terminated_length": 166.2,
924
+ "entropy": 1.4853489220142364,
925
+ "epoch": 0.950920245398773,
926
+ "frac_reward_zero_std": 0.0,
927
+ "grad_norm": 5.2570695877075195,
928
+ "learning_rate": 2.6380368098159506e-07,
929
+ "loss": 0.11004064083099366,
930
+ "num_tokens": 545966.0,
931
+ "reward": -0.15949999541044235,
932
+ "reward_std": 0.19192611873149873,
933
+ "rewards/format_reward/mean": 0.28500000238418577,
934
+ "rewards/format_reward/std": 0.22434256076812745,
935
+ "rewards/security_audit_reward/mean": -0.35,
936
+ "rewards/security_audit_reward/std": 0.2,
937
+ "step": 155,
938
+ "step_time": 33.86747411140077
939
+ },
940
+ {
941
+ "clip_ratio/high_max": 0.0,
942
+ "clip_ratio/high_mean": 0.0,
943
+ "clip_ratio/low_mean": 0.0,
944
+ "clip_ratio/low_min": 0.0,
945
+ "clip_ratio/region_mean": 0.0,
946
+ "completions/clipped_ratio": 0.55,
947
+ "completions/max_length": 512.0,
948
+ "completions/max_terminated_length": 340.8,
949
+ "completions/mean_length": 411.45,
950
+ "completions/mean_terminated_length": 265.9666687011719,
951
+ "completions/min_length": 192.0,
952
+ "completions/min_terminated_length": 192.0,
953
+ "entropy": 1.0828768193721772,
954
+ "epoch": 0.9815950920245399,
955
+ "frac_reward_zero_std": 0.2,
956
+ "grad_norm": 2.6537587642669678,
957
+ "learning_rate": 2.5613496932515337e-07,
958
+ "loss": 0.03556116819381714,
959
+ "num_tokens": 563683.0,
960
+ "reward": -0.20099999010562897,
961
+ "reward_std": 0.1888158166781068,
962
+ "rewards/format_reward/mean": 0.24000000059604645,
963
+ "rewards/format_reward/std": 0.16870398968458175,
964
+ "rewards/security_audit_reward/mean": -0.3899999976158142,
965
+ "rewards/security_audit_reward/std": 0.2199999988079071,
966
+ "step": 160,
967
+ "step_time": 38.665304075799575
968
+ },
969
+ {
970
+ "clip_ratio/high_max": 0.0,
971
+ "clip_ratio/high_mean": 0.0,
972
+ "clip_ratio/low_mean": 0.0,
973
+ "clip_ratio/low_min": 0.0,
974
+ "clip_ratio/region_mean": 0.0,
975
+ "completions/clipped_ratio": 0.3,
976
+ "completions/max_length": 483.2,
977
+ "completions/max_terminated_length": 393.2,
978
+ "completions/mean_length": 359.2,
979
+ "completions/mean_terminated_length": 303.3666687011719,
980
+ "completions/min_length": 178.2,
981
+ "completions/min_terminated_length": 178.2,
982
+ "entropy": 1.1811485469341279,
983
+ "epoch": 1.0122699386503067,
984
+ "frac_reward_zero_std": 0.0,
985
+ "grad_norm": 4.203860282897949,
986
+ "learning_rate": 2.4846625766871163e-07,
987
+ "loss": -0.02532302737236023,
988
+ "num_tokens": 580183.0,
989
+ "reward": -0.1227499857544899,
990
+ "reward_std": 0.2651766210794449,
991
+ "rewards/format_reward/mean": 0.2675000011920929,
992
+ "rewards/format_reward/std": 0.24115291833877564,
993
+ "rewards/security_audit_reward/mean": -0.2899999976158142,
994
+ "rewards/security_audit_reward/std": 0.2812127649784088,
995
+ "step": 165,
996
+ "step_time": 36.34888075860035
997
+ },
998
+ {
999
+ "clip_ratio/high_max": 0.0,
1000
+ "clip_ratio/high_mean": 0.0,
1001
+ "clip_ratio/low_mean": 0.0,
1002
+ "clip_ratio/low_min": 0.0,
1003
+ "clip_ratio/region_mean": 0.0,
1004
+ "completions/clipped_ratio": 0.4,
1005
+ "completions/max_length": 512.0,
1006
+ "completions/max_terminated_length": 345.8,
1007
+ "completions/mean_length": 358.5,
1008
+ "completions/mean_terminated_length": 235.96666870117187,
1009
+ "completions/min_length": 123.0,
1010
+ "completions/min_terminated_length": 123.0,
1011
+ "entropy": 1.2863860994577407,
1012
+ "epoch": 1.0429447852760736,
1013
+ "frac_reward_zero_std": 0.2,
1014
+ "grad_norm": 2.71185302734375,
1015
+ "learning_rate": 2.4079754601226994e-07,
1016
+ "loss": 0.12254136800765991,
1017
+ "num_tokens": 597345.0,
1018
+ "reward": -0.20274999886751174,
1019
+ "reward_std": 0.1825057201087475,
1020
+ "rewards/format_reward/mean": 0.25750000327825545,
1021
+ "rewards/format_reward/std": 0.19183385372161865,
1022
+ "rewards/security_audit_reward/mean": -0.4,
1023
+ "rewards/security_audit_reward/std": 0.19711971282958984,
1024
+ "step": 170,
1025
+ "step_time": 38.96664929399922
1026
+ },
1027
+ {
1028
+ "clip_ratio/high_max": 0.0,
1029
+ "clip_ratio/high_mean": 0.0,
1030
+ "clip_ratio/low_mean": 0.0,
1031
+ "clip_ratio/low_min": 0.0,
1032
+ "clip_ratio/region_mean": 0.0,
1033
+ "completions/clipped_ratio": 0.5,
1034
+ "completions/max_length": 512.0,
1035
+ "completions/max_terminated_length": 323.4,
1036
+ "completions/mean_length": 407.1,
1037
+ "completions/mean_terminated_length": 234.40000610351564,
1038
+ "completions/min_length": 242.8,
1039
+ "completions/min_terminated_length": 140.4,
1040
+ "entropy": 1.179810070991516,
1041
+ "epoch": 1.0736196319018405,
1042
+ "frac_reward_zero_std": 0.2,
1043
+ "grad_norm": 4.413055419921875,
1044
+ "learning_rate": 2.331288343558282e-07,
1045
+ "loss": 0.041037318110466,
1046
+ "num_tokens": 615063.0,
1047
+ "reward": -0.20374999046325684,
1048
+ "reward_std": 0.2052689865231514,
1049
+ "rewards/format_reward/mean": 0.31250000894069674,
1050
+ "rewards/format_reward/std": 0.22553626000881194,
1051
+ "rewards/security_audit_reward/mean": -0.425,
1052
+ "rewards/security_audit_reward/std": 0.23164966106414794,
1053
+ "step": 175,
1054
+ "step_time": 39.00492364500023
1055
+ },
1056
+ {
1057
+ "clip_ratio/high_max": 0.0,
1058
+ "clip_ratio/high_mean": 0.0,
1059
+ "clip_ratio/low_mean": 0.0,
1060
+ "clip_ratio/low_min": 0.0,
1061
+ "clip_ratio/region_mean": 0.0,
1062
+ "completions/clipped_ratio": 0.3,
1063
+ "completions/max_length": 511.6,
1064
+ "completions/max_terminated_length": 424.0,
1065
+ "completions/mean_length": 399.45,
1066
+ "completions/mean_terminated_length": 347.9166748046875,
1067
+ "completions/min_length": 262.6,
1068
+ "completions/min_terminated_length": 262.6,
1069
+ "entropy": 1.1026120364665986,
1070
+ "epoch": 1.1042944785276074,
1071
+ "frac_reward_zero_std": 0.0,
1072
+ "grad_norm": 4.408414840698242,
1073
+ "learning_rate": 2.254601226993865e-07,
1074
+ "loss": 0.058766734600067136,
1075
+ "num_tokens": 632696.0,
1076
+ "reward": -0.16599998623132706,
1077
+ "reward_std": 0.26377752125263215,
1078
+ "rewards/format_reward/mean": 0.24000000655651094,
1079
+ "rewards/format_reward/std": 0.21778101623058319,
1080
+ "rewards/security_audit_reward/mean": -0.34000000059604646,
1081
+ "rewards/security_audit_reward/std": 0.3105652093887329,
1082
+ "step": 180,
1083
+ "step_time": 39.09616019519963
1084
+ },
1085
+ {
1086
+ "clip_ratio/high_max": 0.0,
1087
+ "clip_ratio/high_mean": 0.0,
1088
+ "clip_ratio/low_mean": 0.0,
1089
+ "clip_ratio/low_min": 0.0,
1090
+ "clip_ratio/region_mean": 0.0,
1091
+ "completions/clipped_ratio": 0.2,
1092
+ "completions/max_length": 477.4,
1093
+ "completions/max_terminated_length": 421.6,
1094
+ "completions/mean_length": 322.45,
1095
+ "completions/mean_terminated_length": 273.1000030517578,
1096
+ "completions/min_length": 156.0,
1097
+ "completions/min_terminated_length": 156.0,
1098
+ "entropy": 1.2888785600662231,
1099
+ "epoch": 1.1349693251533743,
1100
+ "frac_reward_zero_std": 0.0,
1101
+ "grad_norm": 3.6877431869506836,
1102
+ "learning_rate": 2.1779141104294476e-07,
1103
+ "loss": -0.0771723210811615,
1104
+ "num_tokens": 649353.0,
1105
+ "reward": -0.1799999952316284,
1106
+ "reward_std": 0.3134476348757744,
1107
+ "rewards/format_reward/mean": 0.2750000089406967,
1108
+ "rewards/format_reward/std": 0.24135999679565429,
1109
+ "rewards/security_audit_reward/mean": -0.375,
1110
+ "rewards/security_audit_reward/std": 0.3593961834907532,
1111
+ "step": 185,
1112
+ "step_time": 36.71157897000012
1113
+ },
1114
+ {
1115
+ "clip_ratio/high_max": 0.0,
1116
+ "clip_ratio/high_mean": 0.0,
1117
+ "clip_ratio/low_mean": 0.0,
1118
+ "clip_ratio/low_min": 0.0,
1119
+ "clip_ratio/region_mean": 0.0,
1120
+ "completions/clipped_ratio": 0.3,
1121
+ "completions/max_length": 495.0,
1122
+ "completions/max_terminated_length": 351.0,
1123
+ "completions/mean_length": 328.8,
1124
+ "completions/mean_terminated_length": 243.7,
1125
+ "completions/min_length": 164.8,
1126
+ "completions/min_terminated_length": 164.8,
1127
+ "entropy": 1.4585140287876128,
1128
+ "epoch": 1.165644171779141,
1129
+ "frac_reward_zero_std": 0.2,
1130
+ "grad_norm": 4.987306118011475,
1131
+ "learning_rate": 2.1012269938650307e-07,
1132
+ "loss": -0.15080010890960693,
1133
+ "num_tokens": 665513.0,
1134
+ "reward": -0.050499990582466125,
1135
+ "reward_std": 0.2684710592031479,
1136
+ "rewards/format_reward/mean": 0.31000000387430193,
1137
+ "rewards/format_reward/std": 0.1994625985622406,
1138
+ "rewards/security_audit_reward/mean": -0.20500000119209288,
1139
+ "rewards/security_audit_reward/std": 0.31255176067352297,
1140
+ "step": 190,
1141
+ "step_time": 37.63543628939988
1142
+ },
1143
+ {
1144
+ "clip_ratio/high_max": 0.0,
1145
+ "clip_ratio/high_mean": 0.0,
1146
+ "clip_ratio/low_mean": 0.0,
1147
+ "clip_ratio/low_min": 0.0,
1148
+ "clip_ratio/region_mean": 0.0,
1149
+ "completions/clipped_ratio": 0.5,
1150
+ "completions/max_length": 512.0,
1151
+ "completions/max_terminated_length": 307.8,
1152
+ "completions/mean_length": 391.75,
1153
+ "completions/mean_terminated_length": 253.06666870117186,
1154
+ "completions/min_length": 178.0,
1155
+ "completions/min_terminated_length": 178.0,
1156
+ "entropy": 1.1293343544006347,
1157
+ "epoch": 1.196319018404908,
1158
+ "frac_reward_zero_std": 0.1,
1159
+ "grad_norm": 5.247244358062744,
1160
+ "learning_rate": 2.0245398773006135e-07,
1161
+ "loss": -0.04229157567024231,
1162
+ "num_tokens": 683268.0,
1163
+ "reward": -0.10224998965859414,
1164
+ "reward_std": 0.19266743455082178,
1165
+ "rewards/format_reward/mean": 0.3124999929219484,
1166
+ "rewards/format_reward/std": 0.1390557773411274,
1167
+ "rewards/security_audit_reward/mean": -0.2800000011920929,
1168
+ "rewards/security_audit_reward/std": 0.23863712549209595,
1169
+ "step": 195,
1170
+ "step_time": 38.93936442300037
1171
+ },
1172
+ {
1173
+ "clip_ratio/high_max": 0.0,
1174
+ "clip_ratio/high_mean": 0.0,
1175
+ "clip_ratio/low_mean": 0.0,
1176
+ "clip_ratio/low_min": 0.0,
1177
+ "clip_ratio/region_mean": 0.0,
1178
+ "completions/clipped_ratio": 0.4,
1179
+ "completions/max_length": 485.0,
1180
+ "completions/max_terminated_length": 346.2,
1181
+ "completions/mean_length": 364.6,
1182
+ "completions/mean_terminated_length": 265.8166748046875,
1183
+ "completions/min_length": 165.8,
1184
+ "completions/min_terminated_length": 165.8,
1185
+ "entropy": 0.8287177711725235,
1186
+ "epoch": 1.2269938650306749,
1187
+ "frac_reward_zero_std": 0.1,
1188
+ "grad_norm": 2.4230945110321045,
1189
+ "learning_rate": 1.9478527607361963e-07,
1190
+ "loss": -0.05633368492126465,
1191
+ "num_tokens": 700760.0,
1192
+ "reward": -0.1807499848306179,
1193
+ "reward_std": 0.18529897555708885,
1194
+ "rewards/format_reward/mean": 0.3075000137090683,
1195
+ "rewards/format_reward/std": 0.15467575192451477,
1196
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
1197
+ "rewards/security_audit_reward/std": 0.2105652093887329,
1198
+ "step": 200,
1199
+ "step_time": 37.065423558799736
1200
+ },
1201
+ {
1202
+ "clip_ratio/high_max": 0.0,
1203
+ "clip_ratio/high_mean": 0.0,
1204
+ "clip_ratio/low_mean": 0.0,
1205
+ "clip_ratio/low_min": 0.0,
1206
+ "clip_ratio/region_mean": 0.0,
1207
+ "completions/clipped_ratio": 0.25,
1208
+ "completions/max_length": 431.6,
1209
+ "completions/max_terminated_length": 338.8,
1210
+ "completions/mean_length": 284.1,
1211
+ "completions/mean_terminated_length": 225.65,
1212
+ "completions/min_length": 119.4,
1213
+ "completions/min_terminated_length": 119.4,
1214
+ "entropy": 1.2736368715763091,
1215
+ "epoch": 1.2576687116564418,
1216
+ "frac_reward_zero_std": 0.1,
1217
+ "grad_norm": 4.797567367553711,
1218
+ "learning_rate": 1.8711656441717791e-07,
1219
+ "loss": 0.08297693133354186,
1220
+ "num_tokens": 716344.0,
1221
+ "reward": -0.07274999544024467,
1222
+ "reward_std": 0.24350565671920776,
1223
+ "rewards/format_reward/mean": 0.31749999821186065,
1224
+ "rewards/format_reward/std": 0.19669782146811485,
1225
+ "rewards/security_audit_reward/mean": -0.23999999985098838,
1226
+ "rewards/security_audit_reward/std": 0.2692204549908638,
1227
+ "step": 205,
1228
+ "step_time": 33.11877055760014
1229
+ },
1230
+ {
1231
+ "clip_ratio/high_max": 0.0,
1232
+ "clip_ratio/high_mean": 0.0,
1233
+ "clip_ratio/low_mean": 0.0,
1234
+ "clip_ratio/low_min": 0.0,
1235
+ "clip_ratio/region_mean": 0.0,
1236
+ "completions/clipped_ratio": 0.25,
1237
+ "completions/max_length": 484.0,
1238
+ "completions/max_terminated_length": 403.2,
1239
+ "completions/mean_length": 335.45,
1240
+ "completions/mean_terminated_length": 276.4500030517578,
1241
+ "completions/min_length": 173.8,
1242
+ "completions/min_terminated_length": 173.8,
1243
+ "entropy": 1.1084223449230195,
1244
+ "epoch": 1.2883435582822087,
1245
+ "frac_reward_zero_std": 0.1,
1246
+ "grad_norm": 2.4603023529052734,
1247
+ "learning_rate": 1.7944785276073617e-07,
1248
+ "loss": 0.07945090532302856,
1249
+ "num_tokens": 733245.0,
1250
+ "reward": -0.13774999380111694,
1251
+ "reward_std": 0.2730386942625046,
1252
+ "rewards/format_reward/mean": 0.2174999989569187,
1253
+ "rewards/format_reward/std": 0.22229814901947975,
1254
+ "rewards/security_audit_reward/mean": -0.29000000059604647,
1255
+ "rewards/security_audit_reward/std": 0.3105652093887329,
1256
+ "step": 210,
1257
+ "step_time": 37.03591289120122
1258
+ },
1259
+ {
1260
+ "clip_ratio/high_max": 0.0,
1261
+ "clip_ratio/high_mean": 0.0,
1262
+ "clip_ratio/low_mean": 0.0,
1263
+ "clip_ratio/low_min": 0.0,
1264
+ "clip_ratio/region_mean": 0.0,
1265
+ "completions/clipped_ratio": 0.4,
1266
+ "completions/max_length": 512.0,
1267
+ "completions/max_terminated_length": 361.8,
1268
+ "completions/mean_length": 345.0,
1269
+ "completions/mean_terminated_length": 261.3000030517578,
1270
+ "completions/min_length": 184.8,
1271
+ "completions/min_terminated_length": 184.8,
1272
+ "entropy": 1.2274070978164673,
1273
+ "epoch": 1.3190184049079754,
1274
+ "frac_reward_zero_std": 0.0,
1275
+ "grad_norm": 4.573819637298584,
1276
+ "learning_rate": 1.7177914110429448e-07,
1277
+ "loss": -0.07497722506523133,
1278
+ "num_tokens": 749917.0,
1279
+ "reward": -0.01174999624490738,
1280
+ "reward_std": 0.3002330154180527,
1281
+ "rewards/format_reward/mean": 0.3225000023841858,
1282
+ "rewards/format_reward/std": 0.1751384623348713,
1283
+ "rewards/security_audit_reward/mean": -0.15499999821186067,
1284
+ "rewards/security_audit_reward/std": 0.3648489773273468,
1285
+ "step": 215,
1286
+ "step_time": 38.89012140319937
1287
+ },
1288
+ {
1289
+ "clip_ratio/high_max": 0.0,
1290
+ "clip_ratio/high_mean": 0.0,
1291
+ "clip_ratio/low_mean": 0.0,
1292
+ "clip_ratio/low_min": 0.0,
1293
+ "clip_ratio/region_mean": 0.0,
1294
+ "completions/clipped_ratio": 0.2,
1295
+ "completions/max_length": 481.8,
1296
+ "completions/max_terminated_length": 344.6,
1297
+ "completions/mean_length": 292.3,
1298
+ "completions/mean_terminated_length": 234.26666870117188,
1299
+ "completions/min_length": 122.2,
1300
+ "completions/min_terminated_length": 122.2,
1301
+ "entropy": 1.1711494624614716,
1302
+ "epoch": 1.3496932515337423,
1303
+ "frac_reward_zero_std": 0.1,
1304
+ "grad_norm": 3.879939556121826,
1305
+ "learning_rate": 1.6411042944785276e-07,
1306
+ "loss": 0.06901218891143798,
1307
+ "num_tokens": 765457.0,
1308
+ "reward": -0.2002499908208847,
1309
+ "reward_std": 0.19745510853827,
1310
+ "rewards/format_reward/mean": 0.20749999657273294,
1311
+ "rewards/format_reward/std": 0.2098293460905552,
1312
+ "rewards/security_audit_reward/mean": -0.375,
1313
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1314
+ "step": 220,
1315
+ "step_time": 36.50884771559977
1316
+ },
1317
+ {
1318
+ "clip_ratio/high_max": 0.0,
1319
+ "clip_ratio/high_mean": 0.0,
1320
+ "clip_ratio/low_mean": 0.0,
1321
+ "clip_ratio/low_min": 0.0,
1322
+ "clip_ratio/region_mean": 0.0,
1323
+ "completions/clipped_ratio": 0.35,
1324
+ "completions/max_length": 451.4,
1325
+ "completions/max_terminated_length": 281.8,
1326
+ "completions/mean_length": 310.85,
1327
+ "completions/mean_terminated_length": 210.58333435058594,
1328
+ "completions/min_length": 144.0,
1329
+ "completions/min_terminated_length": 144.0,
1330
+ "entropy": 1.4326449751853942,
1331
+ "epoch": 1.3803680981595092,
1332
+ "frac_reward_zero_std": 0.0,
1333
+ "grad_norm": 5.483170986175537,
1334
+ "learning_rate": 1.5644171779141104e-07,
1335
+ "loss": -0.03490494191646576,
1336
+ "num_tokens": 782226.0,
1337
+ "reward": -0.14649999141693115,
1338
+ "reward_std": 0.19791007936000823,
1339
+ "rewards/format_reward/mean": 0.3049999952316284,
1340
+ "rewards/format_reward/std": 0.19433450996875762,
1341
+ "rewards/security_audit_reward/mean": -0.3399999998509884,
1342
+ "rewards/security_audit_reward/std": 0.22000000029802322,
1343
+ "step": 225,
1344
+ "step_time": 35.079213985799655
1345
+ },
1346
+ {
1347
+ "clip_ratio/high_max": 0.0,
1348
+ "clip_ratio/high_mean": 0.0,
1349
+ "clip_ratio/low_mean": 0.0,
1350
+ "clip_ratio/low_min": 0.0,
1351
+ "clip_ratio/region_mean": 0.0,
1352
+ "completions/clipped_ratio": 0.35,
1353
+ "completions/max_length": 511.2,
1354
+ "completions/max_terminated_length": 315.2,
1355
+ "completions/mean_length": 338.15,
1356
+ "completions/mean_terminated_length": 226.28333435058593,
1357
+ "completions/min_length": 134.8,
1358
+ "completions/min_terminated_length": 134.8,
1359
+ "entropy": 1.1364098012447357,
1360
+ "epoch": 1.4110429447852761,
1361
+ "frac_reward_zero_std": 0.2,
1362
+ "grad_norm": 3.3578364849090576,
1363
+ "learning_rate": 1.4877300613496933e-07,
1364
+ "loss": 0.0896155834197998,
1365
+ "num_tokens": 798571.0,
1366
+ "reward": -0.11574998870491982,
1367
+ "reward_std": 0.19651760943233967,
1368
+ "rewards/format_reward/mean": 0.2674999989569187,
1369
+ "rewards/format_reward/std": 0.15046989992260934,
1370
+ "rewards/security_audit_reward/mean": -0.27999999821186067,
1371
+ "rewards/security_audit_reward/std": 0.2297215759754181,
1372
+ "step": 230,
1373
+ "step_time": 38.61798697480081
1374
+ },
1375
+ {
1376
+ "clip_ratio/high_max": 0.0,
1377
+ "clip_ratio/high_mean": 0.0,
1378
+ "clip_ratio/low_mean": 0.0,
1379
+ "clip_ratio/low_min": 0.0,
1380
+ "clip_ratio/region_mean": 0.0,
1381
+ "completions/clipped_ratio": 0.35,
1382
+ "completions/max_length": 505.4,
1383
+ "completions/max_terminated_length": 385.4,
1384
+ "completions/mean_length": 360.65,
1385
+ "completions/mean_terminated_length": 295.23333740234375,
1386
+ "completions/min_length": 210.2,
1387
+ "completions/min_terminated_length": 210.2,
1388
+ "entropy": 1.0565216183662414,
1389
+ "epoch": 1.441717791411043,
1390
+ "frac_reward_zero_std": 0.0,
1391
+ "grad_norm": 4.266097068786621,
1392
+ "learning_rate": 1.4110429447852758e-07,
1393
+ "loss": 0.07759050726890564,
1394
+ "num_tokens": 815570.0,
1395
+ "reward": -0.06749999299645423,
1396
+ "reward_std": 0.27374918162822726,
1397
+ "rewards/format_reward/mean": 0.37000001072883604,
1398
+ "rewards/format_reward/std": 0.1865294199436903,
1399
+ "rewards/security_audit_reward/mean": -0.2550000011920929,
1400
+ "rewards/security_audit_reward/std": 0.32802181243896483,
1401
+ "step": 235,
1402
+ "step_time": 38.44030983600023
1403
+ },
1404
+ {
1405
+ "clip_ratio/high_max": 0.0,
1406
+ "clip_ratio/high_mean": 0.0,
1407
+ "clip_ratio/low_mean": 0.0,
1408
+ "clip_ratio/low_min": 0.0,
1409
+ "clip_ratio/region_mean": 0.0,
1410
+ "completions/clipped_ratio": 0.3,
1411
+ "completions/max_length": 495.6,
1412
+ "completions/max_terminated_length": 352.6,
1413
+ "completions/mean_length": 336.05,
1414
+ "completions/mean_terminated_length": 243.60000915527343,
1415
+ "completions/min_length": 145.8,
1416
+ "completions/min_terminated_length": 145.8,
1417
+ "entropy": 1.417020809650421,
1418
+ "epoch": 1.4723926380368098,
1419
+ "frac_reward_zero_std": 0.2,
1420
+ "grad_norm": 4.767548084259033,
1421
+ "learning_rate": 1.334355828220859e-07,
1422
+ "loss": 0.03671485185623169,
1423
+ "num_tokens": 831713.0,
1424
+ "reward": -0.12849999219179153,
1425
+ "reward_std": 0.19381159394979477,
1426
+ "rewards/format_reward/mean": 0.3300000041723251,
1427
+ "rewards/format_reward/std": 0.20168980173766612,
1428
+ "rewards/security_audit_reward/mean": -0.325,
1429
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1430
+ "step": 240,
1431
+ "step_time": 37.49174958900003
1432
+ },
1433
+ {
1434
+ "clip_ratio/high_max": 0.0,
1435
+ "clip_ratio/high_mean": 0.0,
1436
+ "clip_ratio/low_mean": 0.0,
1437
+ "clip_ratio/low_min": 0.0,
1438
+ "clip_ratio/region_mean": 0.0,
1439
+ "completions/clipped_ratio": 0.6,
1440
+ "completions/max_length": 512.0,
1441
+ "completions/max_terminated_length": 343.0,
1442
+ "completions/mean_length": 417.3,
1443
+ "completions/mean_terminated_length": 293.7,
1444
+ "completions/min_length": 246.8,
1445
+ "completions/min_terminated_length": 246.8,
1446
+ "entropy": 1.0088598132133484,
1447
+ "epoch": 1.5030674846625767,
1448
+ "frac_reward_zero_std": 0.0,
1449
+ "grad_norm": 3.1936607360839844,
1450
+ "learning_rate": 1.2576687116564417e-07,
1451
+ "loss": -0.03240810632705689,
1452
+ "num_tokens": 849855.0,
1453
+ "reward": -0.11324999332427979,
1454
+ "reward_std": 0.2602782666683197,
1455
+ "rewards/format_reward/mean": 0.3224999994039536,
1456
+ "rewards/format_reward/std": 0.20757876634597777,
1457
+ "rewards/security_audit_reward/mean": -0.3,
1458
+ "rewards/security_audit_reward/std": 0.3154700517654419,
1459
+ "step": 245,
1460
+ "step_time": 39.07302968719996
1461
+ },
1462
+ {
1463
+ "clip_ratio/high_max": 0.0,
1464
+ "clip_ratio/high_mean": 0.0,
1465
+ "clip_ratio/low_mean": 0.0,
1466
+ "clip_ratio/low_min": 0.0,
1467
+ "clip_ratio/region_mean": 0.0,
1468
+ "completions/clipped_ratio": 0.2,
1469
+ "completions/max_length": 459.6,
1470
+ "completions/max_terminated_length": 428.2,
1471
+ "completions/mean_length": 325.35,
1472
+ "completions/mean_terminated_length": 277.5833374023438,
1473
+ "completions/min_length": 137.0,
1474
+ "completions/min_terminated_length": 137.0,
1475
+ "entropy": 1.1867628961801528,
1476
+ "epoch": 1.5337423312883436,
1477
+ "frac_reward_zero_std": 0.1,
1478
+ "grad_norm": 3.9449942111968994,
1479
+ "learning_rate": 1.1809815950920244e-07,
1480
+ "loss": -0.005416367202997208,
1481
+ "num_tokens": 866330.0,
1482
+ "reward": -0.08199999034404755,
1483
+ "reward_std": 0.2747137784957886,
1484
+ "rewards/format_reward/mean": 0.3099999874830246,
1485
+ "rewards/format_reward/std": 0.17935641929507257,
1486
+ "rewards/security_audit_reward/mean": -0.25,
1487
+ "rewards/security_audit_reward/std": 0.33094010353088377,
1488
+ "step": 250,
1489
+ "step_time": 35.27377968320034
1490
+ },
1491
+ {
1492
+ "clip_ratio/high_max": 0.0,
1493
+ "clip_ratio/high_mean": 0.0,
1494
+ "clip_ratio/low_mean": 0.0,
1495
+ "clip_ratio/low_min": 0.0,
1496
+ "clip_ratio/region_mean": 0.0,
1497
+ "completions/clipped_ratio": 0.25,
1498
+ "completions/max_length": 450.0,
1499
+ "completions/max_terminated_length": 311.2,
1500
+ "completions/mean_length": 283.8,
1501
+ "completions/mean_terminated_length": 202.06666870117186,
1502
+ "completions/min_length": 128.0,
1503
+ "completions/min_terminated_length": 128.0,
1504
+ "entropy": 1.322118791937828,
1505
+ "epoch": 1.5644171779141103,
1506
+ "frac_reward_zero_std": 0.1,
1507
+ "grad_norm": 6.9948039054870605,
1508
+ "learning_rate": 1.1042944785276073e-07,
1509
+ "loss": 0.055895209312438965,
1510
+ "num_tokens": 881650.0,
1511
+ "reward": -0.11924999132752419,
1512
+ "reward_std": 0.14510822538286447,
1513
+ "rewards/format_reward/mean": 0.3024999976158142,
1514
+ "rewards/format_reward/std": 0.15741010159254074,
1515
+ "rewards/security_audit_reward/mean": -0.3,
1516
+ "rewards/security_audit_reward/std": 0.15773502588272095,
1517
+ "step": 255,
1518
+ "step_time": 34.332786842800125
1519
+ },
1520
+ {
1521
+ "clip_ratio/high_max": 0.0,
1522
+ "clip_ratio/high_mean": 0.0,
1523
+ "clip_ratio/low_mean": 0.0,
1524
+ "clip_ratio/low_min": 0.0,
1525
+ "clip_ratio/region_mean": 0.0,
1526
+ "completions/clipped_ratio": 0.3,
1527
+ "completions/max_length": 485.0,
1528
+ "completions/max_terminated_length": 312.2,
1529
+ "completions/mean_length": 312.0,
1530
+ "completions/mean_terminated_length": 225.13333740234376,
1531
+ "completions/min_length": 141.8,
1532
+ "completions/min_terminated_length": 141.8,
1533
+ "entropy": 1.1261488378047944,
1534
+ "epoch": 1.5950920245398774,
1535
+ "frac_reward_zero_std": 0.1,
1536
+ "grad_norm": 5.633542537689209,
1537
+ "learning_rate": 1.0276073619631902e-07,
1538
+ "loss": 0.12422184944152832,
1539
+ "num_tokens": 898170.0,
1540
+ "reward": -0.09424999356269836,
1541
+ "reward_std": 0.2452640563249588,
1542
+ "rewards/format_reward/mean": 0.3274999916553497,
1543
+ "rewards/format_reward/std": 0.19424656331539153,
1544
+ "rewards/security_audit_reward/mean": -0.275,
1545
+ "rewards/security_audit_reward/std": 0.29574271440505984,
1546
+ "step": 260,
1547
+ "step_time": 37.18088837539908
1548
+ },
1549
+ {
1550
+ "clip_ratio/high_max": 0.0,
1551
+ "clip_ratio/high_mean": 0.0,
1552
+ "clip_ratio/low_mean": 0.0,
1553
+ "clip_ratio/low_min": 0.0,
1554
+ "clip_ratio/region_mean": 0.0,
1555
+ "completions/clipped_ratio": 0.3,
1556
+ "completions/max_length": 481.4,
1557
+ "completions/max_terminated_length": 333.8,
1558
+ "completions/mean_length": 323.4,
1559
+ "completions/mean_terminated_length": 233.95,
1560
+ "completions/min_length": 174.6,
1561
+ "completions/min_terminated_length": 174.6,
1562
+ "entropy": 1.3164357602596284,
1563
+ "epoch": 1.6257668711656441,
1564
+ "frac_reward_zero_std": 0.2,
1565
+ "grad_norm": 4.177126884460449,
1566
+ "learning_rate": 9.50920245398773e-08,
1567
+ "loss": -0.0031075358390808107,
1568
+ "num_tokens": 914438.0,
1569
+ "reward": -0.109499990940094,
1570
+ "reward_std": 0.19355954378843307,
1571
+ "rewards/format_reward/mean": 0.3700000077486038,
1572
+ "rewards/format_reward/std": 0.19431518614292145,
1573
+ "rewards/security_audit_reward/mean": -0.31500000059604644,
1574
+ "rewards/security_audit_reward/std": 0.21745660305023193,
1575
+ "step": 265,
1576
+ "step_time": 36.75480746599969
1577
+ },
1578
+ {
1579
+ "clip_ratio/high_max": 0.0,
1580
+ "clip_ratio/high_mean": 0.0,
1581
+ "clip_ratio/low_mean": 0.0,
1582
+ "clip_ratio/low_min": 0.0,
1583
+ "clip_ratio/region_mean": 0.0,
1584
+ "completions/clipped_ratio": 0.15,
1585
+ "completions/max_length": 460.2,
1586
+ "completions/max_terminated_length": 279.2,
1587
+ "completions/mean_length": 224.9,
1588
+ "completions/mean_terminated_length": 170.45000305175782,
1589
+ "completions/min_length": 84.4,
1590
+ "completions/min_terminated_length": 84.4,
1591
+ "entropy": 1.267154586315155,
1592
+ "epoch": 1.656441717791411,
1593
+ "frac_reward_zero_std": 0.2,
1594
+ "grad_norm": 7.552680015563965,
1595
+ "learning_rate": 8.742331288343557e-08,
1596
+ "loss": -0.12154214382171631,
1597
+ "num_tokens": 928536.0,
1598
+ "reward": -0.07999998778104782,
1599
+ "reward_std": 0.16925212144851684,
1600
+ "rewards/format_reward/mean": 0.37499999403953554,
1601
+ "rewards/format_reward/std": 0.1521439790725708,
1602
+ "rewards/security_audit_reward/mean": -0.275,
1603
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1604
+ "step": 270,
1605
+ "step_time": 35.04094967719975
1606
+ },
1607
+ {
1608
+ "clip_ratio/high_max": 0.0,
1609
+ "clip_ratio/high_mean": 0.0,
1610
+ "clip_ratio/low_mean": 0.0,
1611
+ "clip_ratio/low_min": 0.0,
1612
+ "clip_ratio/region_mean": 0.0,
1613
+ "completions/clipped_ratio": 0.3,
1614
+ "completions/max_length": 469.4,
1615
+ "completions/max_terminated_length": 287.6,
1616
+ "completions/mean_length": 300.95,
1617
+ "completions/mean_terminated_length": 204.76667175292968,
1618
+ "completions/min_length": 130.0,
1619
+ "completions/min_terminated_length": 130.0,
1620
+ "entropy": 1.1936017721891403,
1621
+ "epoch": 1.687116564417178,
1622
+ "frac_reward_zero_std": 0.2,
1623
+ "grad_norm": 7.106525421142578,
1624
+ "learning_rate": 7.975460122699386e-08,
1625
+ "loss": -0.0458857923746109,
1626
+ "num_tokens": 944307.0,
1627
+ "reward": -0.05174999088048935,
1628
+ "reward_std": 0.23333178758621215,
1629
+ "rewards/format_reward/mean": 0.3874999940395355,
1630
+ "rewards/format_reward/std": 0.16559004038572311,
1631
+ "rewards/security_audit_reward/mean": -0.24000000059604645,
1632
+ "rewards/security_audit_reward/std": 0.2866505742073059,
1633
+ "step": 275,
1634
+ "step_time": 36.0470411268001
1635
+ },
1636
+ {
1637
+ "clip_ratio/high_max": 0.0,
1638
+ "clip_ratio/high_mean": 0.0,
1639
+ "clip_ratio/low_mean": 0.0,
1640
+ "clip_ratio/low_min": 0.0,
1641
+ "clip_ratio/region_mean": 0.0,
1642
+ "completions/clipped_ratio": 0.2,
1643
+ "completions/max_length": 498.6,
1644
+ "completions/max_terminated_length": 375.0,
1645
+ "completions/mean_length": 295.25,
1646
+ "completions/mean_terminated_length": 232.6666717529297,
1647
+ "completions/min_length": 97.8,
1648
+ "completions/min_terminated_length": 97.8,
1649
+ "entropy": 1.460896384716034,
1650
+ "epoch": 1.7177914110429446,
1651
+ "frac_reward_zero_std": 0.0,
1652
+ "grad_norm": 5.004129409790039,
1653
+ "learning_rate": 7.208588957055214e-08,
1654
+ "loss": -0.10658804178237916,
1655
+ "num_tokens": 960078.0,
1656
+ "reward": -0.01824999153614044,
1657
+ "reward_std": 0.2521414369344711,
1658
+ "rewards/format_reward/mean": 0.3825000047683716,
1659
+ "rewards/format_reward/std": 0.15327396541833876,
1660
+ "rewards/security_audit_reward/mean": -0.1899999976158142,
1661
+ "rewards/security_audit_reward/std": 0.3234777390956879,
1662
+ "step": 280,
1663
+ "step_time": 37.26412723539943
1664
+ },
1665
+ {
1666
+ "clip_ratio/high_max": 0.0,
1667
+ "clip_ratio/high_mean": 0.0,
1668
+ "clip_ratio/low_mean": 0.0,
1669
+ "clip_ratio/low_min": 0.0,
1670
+ "clip_ratio/region_mean": 0.0,
1671
+ "completions/clipped_ratio": 0.3,
1672
+ "completions/max_length": 501.0,
1673
+ "completions/max_terminated_length": 363.2,
1674
+ "completions/mean_length": 340.6,
1675
+ "completions/mean_terminated_length": 257.06666870117186,
1676
+ "completions/min_length": 146.2,
1677
+ "completions/min_terminated_length": 146.2,
1678
+ "entropy": 1.0431257128715514,
1679
+ "epoch": 1.7484662576687118,
1680
+ "frac_reward_zero_std": 0.1,
1681
+ "grad_norm": 4.059199333190918,
1682
+ "learning_rate": 6.441717791411043e-08,
1683
+ "loss": -0.10386581420898437,
1684
+ "num_tokens": 976992.0,
1685
+ "reward": -0.06924999207258224,
1686
+ "reward_std": 0.2484972782433033,
1687
+ "rewards/format_reward/mean": 0.38750000596046447,
1688
+ "rewards/format_reward/std": 0.1250488668680191,
1689
+ "rewards/security_audit_reward/mean": -0.26500000059604645,
1690
+ "rewards/security_audit_reward/std": 0.3080150008201599,
1691
+ "step": 285,
1692
+ "step_time": 38.17630281240017
1693
+ },
1694
+ {
1695
+ "clip_ratio/high_max": 0.0,
1696
+ "clip_ratio/high_mean": 0.0,
1697
+ "clip_ratio/low_mean": 0.0,
1698
+ "clip_ratio/low_min": 0.0,
1699
+ "clip_ratio/region_mean": 0.0,
1700
+ "completions/clipped_ratio": 0.05,
1701
+ "completions/max_length": 435.0,
1702
+ "completions/max_terminated_length": 413.6,
1703
+ "completions/mean_length": 281.65,
1704
+ "completions/mean_terminated_length": 274.0,
1705
+ "completions/min_length": 152.0,
1706
+ "completions/min_terminated_length": 152.0,
1707
+ "entropy": 1.1775987446308136,
1708
+ "epoch": 1.7791411042944785,
1709
+ "frac_reward_zero_std": 0.3,
1710
+ "grad_norm": 4.12350606918335,
1711
+ "learning_rate": 5.674846625766871e-08,
1712
+ "loss": -0.03856886327266693,
1713
+ "num_tokens": 992819.0,
1714
+ "reward": -0.05999999046325684,
1715
+ "reward_std": 0.1456713281571865,
1716
+ "rewards/format_reward/mean": 0.360000005364418,
1717
+ "rewards/format_reward/std": 0.12044776938855647,
1718
+ "rewards/security_audit_reward/mean": -0.23999999985098838,
1719
+ "rewards/security_audit_reward/std": 0.17773502618074416,
1720
+ "step": 290,
1721
+ "step_time": 33.279480656001034
1722
+ },
1723
+ {
1724
+ "clip_ratio/high_max": 0.0,
1725
+ "clip_ratio/high_mean": 0.0,
1726
+ "clip_ratio/low_mean": 0.0,
1727
+ "clip_ratio/low_min": 0.0,
1728
+ "clip_ratio/region_mean": 0.0,
1729
+ "completions/clipped_ratio": 0.15,
1730
+ "completions/max_length": 470.0,
1731
+ "completions/max_terminated_length": 383.0,
1732
+ "completions/mean_length": 306.15,
1733
+ "completions/mean_terminated_length": 264.66666717529296,
1734
+ "completions/min_length": 170.8,
1735
+ "completions/min_terminated_length": 170.8,
1736
+ "entropy": 1.3437508165836334,
1737
+ "epoch": 1.8098159509202454,
1738
+ "frac_reward_zero_std": 0.0,
1739
+ "grad_norm": 5.014013290405273,
1740
+ "learning_rate": 4.907975460122699e-08,
1741
+ "loss": 0.1800641179084778,
1742
+ "num_tokens": 1008676.0,
1743
+ "reward": -0.14524998962879182,
1744
+ "reward_std": 0.19439554661512376,
1745
+ "rewards/format_reward/mean": 0.33249999284744264,
1746
+ "rewards/format_reward/std": 0.21220951080322265,
1747
+ "rewards/security_audit_reward/mean": -0.35,
1748
+ "rewards/security_audit_reward/std": 0.2154700517654419,
1749
+ "step": 295,
1750
+ "step_time": 35.92737517459973
1751
+ },
1752
+ {
1753
+ "clip_ratio/high_max": 0.0,
1754
+ "clip_ratio/high_mean": 0.0,
1755
+ "clip_ratio/low_mean": 0.0,
1756
+ "clip_ratio/low_min": 0.0,
1757
+ "clip_ratio/region_mean": 0.0,
1758
+ "completions/clipped_ratio": 0.3,
1759
+ "completions/max_length": 471.6,
1760
+ "completions/max_terminated_length": 299.6,
1761
+ "completions/mean_length": 360.45,
1762
+ "completions/mean_terminated_length": 234.9166687011719,
1763
+ "completions/min_length": 229.0,
1764
+ "completions/min_terminated_length": 126.6,
1765
+ "entropy": 1.1840724140405654,
1766
+ "epoch": 1.8404907975460123,
1767
+ "frac_reward_zero_std": 0.1,
1768
+ "grad_norm": 2.570652484893799,
1769
+ "learning_rate": 4.1411042944785274e-08,
1770
+ "loss": 0.019638296961784363,
1771
+ "num_tokens": 1025285.0,
1772
+ "reward": -0.06699999049305916,
1773
+ "reward_std": 0.2400740846991539,
1774
+ "rewards/format_reward/mean": 0.3600000023841858,
1775
+ "rewards/format_reward/std": 0.1531308189034462,
1776
+ "rewards/security_audit_reward/mean": -0.25,
1777
+ "rewards/security_audit_reward/std": 0.2868344783782959,
1778
+ "step": 300,
1779
+ "step_time": 35.41554451999982
1780
+ }
1781
+ ],
1782
+ "logging_steps": 5,
1783
+ "max_steps": 326,
1784
+ "num_input_tokens_seen": 1025285,
1785
+ "num_train_epochs": 2,
1786
+ "save_steps": 50,
1787
+ "stateful_callbacks": {
1788
+ "TrainerControl": {
1789
+ "args": {
1790
+ "should_epoch_stop": false,
1791
+ "should_evaluate": false,
1792
+ "should_log": false,
1793
+ "should_save": true,
1794
+ "should_training_stop": false
1795
+ },
1796
+ "attributes": {}
1797
+ }
1798
+ },
1799
+ "total_flos": 0.0,
1800
+ "train_batch_size": 2,
1801
+ "trial_name": null,
1802
+ "trial_params": null
1803
+ }
checkpoint-300/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b51f3856815802830b7add9e23ddd089207e5c9941078dd606f120af0f983d09
3
+ size 6776
checkpoint-326/chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
checkpoint-326/config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": null,
7
+ "dtype": "float32",
8
+ "eos_token_id": 151645,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 896,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4864,
13
+ "layer_types": [
14
+ "full_attention",
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention"
38
+ ],
39
+ "max_position_embeddings": 32768,
40
+ "max_window_layers": 24,
41
+ "model_type": "qwen2",
42
+ "num_attention_heads": 14,
43
+ "num_hidden_layers": 24,
44
+ "num_key_value_heads": 2,
45
+ "pad_token_id": 151643,
46
+ "rms_norm_eps": 1e-06,
47
+ "rope_parameters": {
48
+ "rope_theta": 1000000.0,
49
+ "rope_type": "default"
50
+ },
51
+ "sliding_window": null,
52
+ "tie_word_embeddings": true,
53
+ "transformers_version": "5.6.2",
54
+ "use_cache": false,
55
+ "use_sliding_window": false,
56
+ "vocab_size": 151936
57
+ }
checkpoint-326/generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151643
6
+ ],
7
+ "pad_token_id": 151643,
8
+ "repetition_penalty": 1.05,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "5.6.2"
13
+ }
checkpoint-326/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c64421ad1b2b08b8f84687a636657a68a9f6c9ef639c6c2dc449cb93d2c4219
3
+ size 1976163472
checkpoint-326/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2029a2d6a0a790f550621622c3c990b0d5f81492de3d7642988e1fa042d3a073
3
+ size 3952505274
checkpoint-326/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7709cb910b037a9984c235d2fc7fe7fd99ccb8982993a4ca396269149709e777
3
+ size 14244
checkpoint-326/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3aa6a7bb4866149bd0f10ff54d9da7e2c37c93aa3f37a2a6471c11bd6760f19
3
+ size 1064
checkpoint-326/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
+ size 11421892
checkpoint-326/tokenizer_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": false,
24
+ "local_files_only": false,
25
+ "model_max_length": 32768,
26
+ "pad_token": "<|endoftext|>",
27
+ "padding_side": "left",
28
+ "split_special_tokens": false,
29
+ "tokenizer_class": "Qwen2Tokenizer",
30
+ "truncation_side": "left",
31
+ "unk_token": null
32
+ }
checkpoint-326/trainer_state.json ADDED
@@ -0,0 +1,1948 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 2.0,
6
+ "eval_steps": 500,
7
+ "global_step": 326,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "clip_ratio/high_max": 0.0,
14
+ "clip_ratio/high_mean": 0.0,
15
+ "clip_ratio/low_mean": 0.0,
16
+ "clip_ratio/low_min": 0.0,
17
+ "clip_ratio/region_mean": 0.0,
18
+ "completions/clipped_ratio": 0.75,
19
+ "completions/max_length": 512.0,
20
+ "completions/max_terminated_length": 37.0,
21
+ "completions/mean_length": 393.25,
22
+ "completions/mean_terminated_length": 37.0,
23
+ "completions/min_length": 37.0,
24
+ "completions/min_terminated_length": 37.0,
25
+ "entropy": 1.4897738695144653,
26
+ "epoch": 0.006134969325153374,
27
+ "frac_reward_zero_std": 0.5,
28
+ "grad_norm": 2.2988293170928955,
29
+ "learning_rate": 5e-07,
30
+ "loss": -0.21252349019050598,
31
+ "num_tokens": 3567.0,
32
+ "reward": -0.3424999713897705,
33
+ "reward_std": 0.01500000525265932,
34
+ "rewards/format_reward/mean": 0.02500000037252903,
35
+ "rewards/format_reward/std": 0.05000000074505806,
36
+ "rewards/security_audit_reward/mean": -0.5,
37
+ "rewards/security_audit_reward/std": 0.0,
38
+ "step": 1,
39
+ "step_time": 39.508622552999896
40
+ },
41
+ {
42
+ "clip_ratio/high_max": 0.0,
43
+ "clip_ratio/high_mean": 0.0,
44
+ "clip_ratio/low_mean": 0.0,
45
+ "clip_ratio/low_min": 0.0,
46
+ "clip_ratio/region_mean": 0.0,
47
+ "completions/clipped_ratio": 0.5,
48
+ "completions/max_length": 512.0,
49
+ "completions/max_terminated_length": 248.75,
50
+ "completions/mean_length": 389.625,
51
+ "completions/mean_terminated_length": 192.79166793823242,
52
+ "completions/min_length": 272.75,
53
+ "completions/min_terminated_length": 144.75,
54
+ "entropy": 1.363443061709404,
55
+ "epoch": 0.03067484662576687,
56
+ "frac_reward_zero_std": 0.375,
57
+ "grad_norm": 4.688082218170166,
58
+ "learning_rate": 4.938650306748465e-07,
59
+ "loss": 0.04808004945516586,
60
+ "num_tokens": 17675.0,
61
+ "reward": -0.2981249839067459,
62
+ "reward_std": 0.08178356755524874,
63
+ "rewards/format_reward/mean": 0.10000000381842256,
64
+ "rewards/format_reward/std": 0.12774468399584293,
65
+ "rewards/security_audit_reward/mean": -0.46875,
66
+ "rewards/security_audit_reward/std": 0.0625,
67
+ "step": 5,
68
+ "step_time": 38.500043476749966
69
+ },
70
+ {
71
+ "clip_ratio/high_max": 0.0,
72
+ "clip_ratio/high_mean": 0.0,
73
+ "clip_ratio/low_mean": 0.0,
74
+ "clip_ratio/low_min": 0.0,
75
+ "clip_ratio/region_mean": 0.0,
76
+ "completions/clipped_ratio": 0.65,
77
+ "completions/max_length": 512.0,
78
+ "completions/max_terminated_length": 345.6,
79
+ "completions/mean_length": 463.0,
80
+ "completions/mean_terminated_length": 305.6,
81
+ "completions/min_length": 363.8,
82
+ "completions/min_terminated_length": 261.4,
83
+ "entropy": 1.4113845229148865,
84
+ "epoch": 0.06134969325153374,
85
+ "frac_reward_zero_std": 0.4,
86
+ "grad_norm": 3.245452880859375,
87
+ "learning_rate": 4.86196319018405e-07,
88
+ "loss": -0.00041331946849823,
89
+ "num_tokens": 37093.0,
90
+ "reward": -0.29424998760223386,
91
+ "reward_std": 0.08391451295465231,
92
+ "rewards/format_reward/mean": 0.12750000804662703,
93
+ "rewards/format_reward/std": 0.16304838731884957,
94
+ "rewards/security_audit_reward/mean": -0.475,
95
+ "rewards/security_audit_reward/std": 0.05,
96
+ "step": 10,
97
+ "step_time": 39.192330704800135
98
+ },
99
+ {
100
+ "clip_ratio/high_max": 0.0,
101
+ "clip_ratio/high_mean": 0.0,
102
+ "clip_ratio/low_mean": 0.0,
103
+ "clip_ratio/low_min": 0.0,
104
+ "clip_ratio/region_mean": 0.0,
105
+ "completions/clipped_ratio": 0.6,
106
+ "completions/max_length": 512.0,
107
+ "completions/max_terminated_length": 394.0,
108
+ "completions/mean_length": 455.5,
109
+ "completions/mean_terminated_length": 352.9,
110
+ "completions/min_length": 311.8,
111
+ "completions/min_terminated_length": 311.8,
112
+ "entropy": 1.179759132862091,
113
+ "epoch": 0.09202453987730061,
114
+ "frac_reward_zero_std": 0.7,
115
+ "grad_norm": 2.9624693393707275,
116
+ "learning_rate": 4.785276073619632e-07,
117
+ "loss": 0.03452911972999573,
118
+ "num_tokens": 55311.0,
119
+ "reward": -0.2887499898672104,
120
+ "reward_std": 0.09658594038337469,
121
+ "rewards/format_reward/mean": 0.0875,
122
+ "rewards/format_reward/std": 0.08947573080658913,
123
+ "rewards/security_audit_reward/mean": -0.45,
124
+ "rewards/security_audit_reward/std": 0.1,
125
+ "step": 15,
126
+ "step_time": 38.30515608799997
127
+ },
128
+ {
129
+ "clip_ratio/high_max": 0.0,
130
+ "clip_ratio/high_mean": 0.0,
131
+ "clip_ratio/low_mean": 0.0,
132
+ "clip_ratio/low_min": 0.0,
133
+ "clip_ratio/region_mean": 0.0,
134
+ "completions/clipped_ratio": 0.45,
135
+ "completions/max_length": 512.0,
136
+ "completions/max_terminated_length": 395.2,
137
+ "completions/mean_length": 416.9,
138
+ "completions/mean_terminated_length": 328.76666870117185,
139
+ "completions/min_length": 260.8,
140
+ "completions/min_terminated_length": 260.8,
141
+ "entropy": 1.298638153076172,
142
+ "epoch": 0.12269938650306748,
143
+ "frac_reward_zero_std": 0.2,
144
+ "grad_norm": 4.034470081329346,
145
+ "learning_rate": 4.7085889570552147e-07,
146
+ "loss": -0.008246073126792907,
147
+ "num_tokens": 72771.0,
148
+ "reward": -0.23124998807907104,
149
+ "reward_std": 0.16768747363239528,
150
+ "rewards/format_reward/mean": 0.19750000424683095,
151
+ "rewards/format_reward/std": 0.2057904489338398,
152
+ "rewards/security_audit_reward/mean": -0.4149999976158142,
153
+ "rewards/security_audit_reward/std": 0.16999999880790712,
154
+ "step": 20,
155
+ "step_time": 37.87772348239996
156
+ },
157
+ {
158
+ "clip_ratio/high_max": 0.0,
159
+ "clip_ratio/high_mean": 0.0,
160
+ "clip_ratio/low_mean": 0.0,
161
+ "clip_ratio/low_min": 0.0,
162
+ "clip_ratio/region_mean": 0.0,
163
+ "completions/clipped_ratio": 0.3,
164
+ "completions/max_length": 512.0,
165
+ "completions/max_terminated_length": 423.4,
166
+ "completions/mean_length": 382.1,
167
+ "completions/mean_terminated_length": 334.1000091552734,
168
+ "completions/min_length": 236.0,
169
+ "completions/min_terminated_length": 236.0,
170
+ "entropy": 1.317835807800293,
171
+ "epoch": 0.15337423312883436,
172
+ "frac_reward_zero_std": 0.3,
173
+ "grad_norm": 2.853423595428467,
174
+ "learning_rate": 4.631901840490797e-07,
175
+ "loss": -0.013739901781082153,
176
+ "num_tokens": 89889.0,
177
+ "reward": -0.2974999874830246,
178
+ "reward_std": 0.15671177953481674,
179
+ "rewards/format_reward/mean": 0.17500000596046447,
180
+ "rewards/format_reward/std": 0.18444484770298003,
181
+ "rewards/security_audit_reward/mean": -0.5,
182
+ "rewards/security_audit_reward/std": 0.15773502588272095,
183
+ "step": 25,
184
+ "step_time": 38.74009619139997
185
+ },
186
+ {
187
+ "clip_ratio/high_max": 0.0,
188
+ "clip_ratio/high_mean": 0.0,
189
+ "clip_ratio/low_mean": 0.0,
190
+ "clip_ratio/low_min": 0.0,
191
+ "clip_ratio/region_mean": 0.0,
192
+ "completions/clipped_ratio": 0.65,
193
+ "completions/max_length": 512.0,
194
+ "completions/max_terminated_length": 346.0,
195
+ "completions/mean_length": 463.35,
196
+ "completions/mean_terminated_length": 295.3,
197
+ "completions/min_length": 337.8,
198
+ "completions/min_terminated_length": 235.4,
199
+ "entropy": 1.1444598376750945,
200
+ "epoch": 0.18404907975460122,
201
+ "frac_reward_zero_std": 0.2,
202
+ "grad_norm": 3.64375901222229,
203
+ "learning_rate": 4.55521472392638e-07,
204
+ "loss": -0.03970654606819153,
205
+ "num_tokens": 108664.0,
206
+ "reward": -0.3184999763965607,
207
+ "reward_std": 0.04019503518939018,
208
+ "rewards/format_reward/mean": 0.10499999970197678,
209
+ "rewards/format_reward/std": 0.13398344144225122,
210
+ "rewards/security_audit_reward/mean": -0.5,
211
+ "rewards/security_audit_reward/std": 0.0,
212
+ "step": 30,
213
+ "step_time": 38.56538706479987
214
+ },
215
+ {
216
+ "clip_ratio/high_max": 0.0,
217
+ "clip_ratio/high_mean": 0.0,
218
+ "clip_ratio/low_mean": 0.0,
219
+ "clip_ratio/low_min": 0.0,
220
+ "clip_ratio/region_mean": 0.0,
221
+ "completions/clipped_ratio": 0.4,
222
+ "completions/max_length": 504.2,
223
+ "completions/max_terminated_length": 454.6,
224
+ "completions/mean_length": 421.25,
225
+ "completions/mean_terminated_length": 386.0,
226
+ "completions/min_length": 328.6,
227
+ "completions/min_terminated_length": 328.6,
228
+ "entropy": 1.3522289156913758,
229
+ "epoch": 0.2147239263803681,
230
+ "frac_reward_zero_std": 0.3,
231
+ "grad_norm": 3.4385552406311035,
232
+ "learning_rate": 4.4785276073619634e-07,
233
+ "loss": -0.06348788738250732,
234
+ "num_tokens": 126953.0,
235
+ "reward": -0.32824997901916503,
236
+ "reward_std": 0.03220053892582655,
237
+ "rewards/format_reward/mean": 0.07250000201165677,
238
+ "rewards/format_reward/std": 0.10733511671423912,
239
+ "rewards/security_audit_reward/mean": -0.5,
240
+ "rewards/security_audit_reward/std": 0.0,
241
+ "step": 35,
242
+ "step_time": 37.87626404739986
243
+ },
244
+ {
245
+ "clip_ratio/high_max": 0.0,
246
+ "clip_ratio/high_mean": 0.0,
247
+ "clip_ratio/low_mean": 0.0,
248
+ "clip_ratio/low_min": 0.0,
249
+ "clip_ratio/region_mean": 0.0,
250
+ "completions/clipped_ratio": 0.55,
251
+ "completions/max_length": 505.2,
252
+ "completions/max_terminated_length": 334.4,
253
+ "completions/mean_length": 414.9,
254
+ "completions/mean_terminated_length": 240.3,
255
+ "completions/min_length": 243.0,
256
+ "completions/min_terminated_length": 140.6,
257
+ "entropy": 1.230024951696396,
258
+ "epoch": 0.24539877300613497,
259
+ "frac_reward_zero_std": 0.3,
260
+ "grad_norm": 4.400479793548584,
261
+ "learning_rate": 4.401840490797546e-07,
262
+ "loss": 0.11927952766418456,
263
+ "num_tokens": 144785.0,
264
+ "reward": -0.2897499829530716,
265
+ "reward_std": 0.12973095811903476,
266
+ "rewards/format_reward/mean": 0.14250000044703484,
267
+ "rewards/format_reward/std": 0.14365934804081917,
268
+ "rewards/security_audit_reward/mean": -0.475,
269
+ "rewards/security_audit_reward/std": 0.13164966106414794,
270
+ "step": 40,
271
+ "step_time": 37.8069536416001
272
+ },
273
+ {
274
+ "clip_ratio/high_max": 0.0,
275
+ "clip_ratio/high_mean": 0.0,
276
+ "clip_ratio/low_mean": 0.0,
277
+ "clip_ratio/low_min": 0.0,
278
+ "clip_ratio/region_mean": 0.0,
279
+ "completions/clipped_ratio": 0.6,
280
+ "completions/max_length": 472.2,
281
+ "completions/max_terminated_length": 190.6,
282
+ "completions/mean_length": 412.7,
283
+ "completions/mean_terminated_length": 160.85,
284
+ "completions/min_length": 333.8,
285
+ "completions/min_terminated_length": 129.0,
286
+ "entropy": 1.2133947968482972,
287
+ "epoch": 0.27607361963190186,
288
+ "frac_reward_zero_std": 0.1,
289
+ "grad_norm": 4.325937271118164,
290
+ "learning_rate": 4.3251533742331285e-07,
291
+ "loss": 0.025146520137786864,
292
+ "num_tokens": 162443.0,
293
+ "reward": -0.1574999876320362,
294
+ "reward_std": 0.2636621415615082,
295
+ "rewards/format_reward/mean": 0.24500001072883607,
296
+ "rewards/format_reward/std": 0.23762110471725464,
297
+ "rewards/security_audit_reward/mean": -0.32999999523162843,
298
+ "rewards/security_audit_reward/std": 0.28574271202087403,
299
+ "step": 45,
300
+ "step_time": 34.90182834920015
301
+ },
302
+ {
303
+ "clip_ratio/high_max": 0.0,
304
+ "clip_ratio/high_mean": 0.0,
305
+ "clip_ratio/low_mean": 0.0,
306
+ "clip_ratio/low_min": 0.0,
307
+ "clip_ratio/region_mean": 0.0,
308
+ "completions/clipped_ratio": 0.55,
309
+ "completions/max_length": 512.0,
310
+ "completions/max_terminated_length": 308.6,
311
+ "completions/mean_length": 397.05,
312
+ "completions/mean_terminated_length": 259.2,
313
+ "completions/min_length": 203.0,
314
+ "completions/min_terminated_length": 203.0,
315
+ "entropy": 1.4294291973114013,
316
+ "epoch": 0.3067484662576687,
317
+ "frac_reward_zero_std": 0.4,
318
+ "grad_norm": 3.9505743980407715,
319
+ "learning_rate": 4.2484662576687116e-07,
320
+ "loss": -0.08058007955551147,
321
+ "num_tokens": 180200.0,
322
+ "reward": -0.29249998927116394,
323
+ "reward_std": 0.10127481501549482,
324
+ "rewards/format_reward/mean": 0.0750000026077032,
325
+ "rewards/format_reward/std": 0.1127780631184578,
326
+ "rewards/security_audit_reward/mean": -0.45,
327
+ "rewards/security_audit_reward/std": 0.1,
328
+ "step": 50,
329
+ "step_time": 38.750808009400046
330
+ },
331
+ {
332
+ "clip_ratio/high_max": 0.0,
333
+ "clip_ratio/high_mean": 0.0,
334
+ "clip_ratio/low_mean": 0.0,
335
+ "clip_ratio/low_min": 0.0,
336
+ "clip_ratio/region_mean": 0.0,
337
+ "completions/clipped_ratio": 0.5,
338
+ "completions/max_length": 512.0,
339
+ "completions/max_terminated_length": 389.8,
340
+ "completions/mean_length": 404.65,
341
+ "completions/mean_terminated_length": 297.43333740234374,
342
+ "completions/min_length": 192.8,
343
+ "completions/min_terminated_length": 192.8,
344
+ "entropy": 1.2564165532588958,
345
+ "epoch": 0.3374233128834356,
346
+ "frac_reward_zero_std": 0.3,
347
+ "grad_norm": 3.3762269020080566,
348
+ "learning_rate": 4.171779141104294e-07,
349
+ "loss": -0.030467823147773743,
350
+ "num_tokens": 198109.0,
351
+ "reward": -0.2542499825358391,
352
+ "reward_std": 0.07489922866225243,
353
+ "rewards/format_reward/mean": 0.20250000841915608,
354
+ "rewards/format_reward/std": 0.1368803471326828,
355
+ "rewards/security_audit_reward/mean": -0.45,
356
+ "rewards/security_audit_reward/std": 0.05773502588272095,
357
+ "step": 55,
358
+ "step_time": 38.411275500399825
359
+ },
360
+ {
361
+ "clip_ratio/high_max": 0.0,
362
+ "clip_ratio/high_mean": 0.0,
363
+ "clip_ratio/low_mean": 0.0,
364
+ "clip_ratio/low_min": 0.0,
365
+ "clip_ratio/region_mean": 0.0,
366
+ "completions/clipped_ratio": 0.35,
367
+ "completions/max_length": 496.2,
368
+ "completions/max_terminated_length": 448.8,
369
+ "completions/mean_length": 394.3,
370
+ "completions/mean_terminated_length": 358.6166687011719,
371
+ "completions/min_length": 285.4,
372
+ "completions/min_terminated_length": 285.4,
373
+ "entropy": 1.2620218694210052,
374
+ "epoch": 0.36809815950920244,
375
+ "frac_reward_zero_std": 0.2,
376
+ "grad_norm": 2.8227944374084473,
377
+ "learning_rate": 4.095092024539877e-07,
378
+ "loss": 0.039707571268081665,
379
+ "num_tokens": 215747.0,
380
+ "reward": -0.2729999750852585,
381
+ "reward_std": 0.13599938787519933,
382
+ "rewards/format_reward/mean": 0.14000000432133675,
383
+ "rewards/format_reward/std": 0.14343783408403396,
384
+ "rewards/security_audit_reward/mean": -0.45,
385
+ "rewards/security_audit_reward/std": 0.1393846869468689,
386
+ "step": 60,
387
+ "step_time": 37.50645367139987
388
+ },
389
+ {
390
+ "clip_ratio/high_max": 0.0,
391
+ "clip_ratio/high_mean": 0.0,
392
+ "clip_ratio/low_mean": 0.0,
393
+ "clip_ratio/low_min": 0.0,
394
+ "clip_ratio/region_mean": 0.0,
395
+ "completions/clipped_ratio": 0.35,
396
+ "completions/max_length": 484.8,
397
+ "completions/max_terminated_length": 334.8,
398
+ "completions/mean_length": 381.35,
399
+ "completions/mean_terminated_length": 255.98333740234375,
400
+ "completions/min_length": 275.6,
401
+ "completions/min_terminated_length": 173.2,
402
+ "entropy": 1.2798833012580872,
403
+ "epoch": 0.3987730061349693,
404
+ "frac_reward_zero_std": 0.2,
405
+ "grad_norm": 3.4819753170013428,
406
+ "learning_rate": 4.01840490797546e-07,
407
+ "loss": -0.06275686025619506,
408
+ "num_tokens": 233162.0,
409
+ "reward": -0.201749986410141,
410
+ "reward_std": 0.2016347900032997,
411
+ "rewards/format_reward/mean": 0.20250000804662704,
412
+ "rewards/format_reward/std": 0.22137173414230346,
413
+ "rewards/security_audit_reward/mean": -0.375,
414
+ "rewards/security_audit_reward/std": 0.20773502588272094,
415
+ "step": 65,
416
+ "step_time": 37.139902984000216
417
+ },
418
+ {
419
+ "clip_ratio/high_max": 0.0,
420
+ "clip_ratio/high_mean": 0.0,
421
+ "clip_ratio/low_mean": 0.0,
422
+ "clip_ratio/low_min": 0.0,
423
+ "clip_ratio/region_mean": 0.0,
424
+ "completions/clipped_ratio": 0.6,
425
+ "completions/max_length": 512.0,
426
+ "completions/max_terminated_length": 289.8,
427
+ "completions/mean_length": 408.45,
428
+ "completions/mean_terminated_length": 251.7,
429
+ "completions/min_length": 207.0,
430
+ "completions/min_terminated_length": 207.0,
431
+ "entropy": 1.2134525895118713,
432
+ "epoch": 0.4294478527607362,
433
+ "frac_reward_zero_std": 0.3,
434
+ "grad_norm": 4.019806861877441,
435
+ "learning_rate": 3.941717791411043e-07,
436
+ "loss": 0.08099154829978943,
437
+ "num_tokens": 251321.0,
438
+ "reward": -0.27599998414516447,
439
+ "reward_std": 0.08945702444761991,
440
+ "rewards/format_reward/mean": 0.1300000037997961,
441
+ "rewards/format_reward/std": 0.1645726040005684,
442
+ "rewards/security_audit_reward/mean": -0.45,
443
+ "rewards/security_audit_reward/std": 0.05773502588272095,
444
+ "step": 70,
445
+ "step_time": 38.02127088899997
446
+ },
447
+ {
448
+ "clip_ratio/high_max": 0.0,
449
+ "clip_ratio/high_mean": 0.0,
450
+ "clip_ratio/low_mean": 0.0,
451
+ "clip_ratio/low_min": 0.0,
452
+ "clip_ratio/region_mean": 0.0,
453
+ "completions/clipped_ratio": 0.5,
454
+ "completions/max_length": 459.8,
455
+ "completions/max_terminated_length": 206.2,
456
+ "completions/mean_length": 348.9,
457
+ "completions/mean_terminated_length": 176.9,
458
+ "completions/min_length": 148.6,
459
+ "completions/min_terminated_length": 148.6,
460
+ "entropy": 1.3179432690143584,
461
+ "epoch": 0.4601226993865031,
462
+ "frac_reward_zero_std": 0.0,
463
+ "grad_norm": 6.28598690032959,
464
+ "learning_rate": 3.8650306748466255e-07,
465
+ "loss": -0.11171818971633911,
466
+ "num_tokens": 267725.0,
467
+ "reward": -0.19474998638033866,
468
+ "reward_std": 0.17031802013516426,
469
+ "rewards/format_reward/mean": 0.23750000447034836,
470
+ "rewards/format_reward/std": 0.17114628925919534,
471
+ "rewards/security_audit_reward/mean": -0.3800000011920929,
472
+ "rewards/security_audit_reward/std": 0.180902099609375,
473
+ "step": 75,
474
+ "step_time": 34.450158203000136
475
+ },
476
+ {
477
+ "clip_ratio/high_max": 0.0,
478
+ "clip_ratio/high_mean": 0.0,
479
+ "clip_ratio/low_mean": 0.0,
480
+ "clip_ratio/low_min": 0.0,
481
+ "clip_ratio/region_mean": 0.0,
482
+ "completions/clipped_ratio": 0.5,
483
+ "completions/max_length": 512.0,
484
+ "completions/max_terminated_length": 332.8,
485
+ "completions/mean_length": 400.75,
486
+ "completions/mean_terminated_length": 289.06666870117186,
487
+ "completions/min_length": 251.0,
488
+ "completions/min_terminated_length": 251.0,
489
+ "entropy": 1.1514661133289337,
490
+ "epoch": 0.49079754601226994,
491
+ "frac_reward_zero_std": 0.4,
492
+ "grad_norm": 2.8479247093200684,
493
+ "learning_rate": 3.788343558282208e-07,
494
+ "loss": 0.03145935535430908,
495
+ "num_tokens": 285726.0,
496
+ "reward": -0.2569999933242798,
497
+ "reward_std": 0.15333212018013,
498
+ "rewards/format_reward/mean": 0.1350000023841858,
499
+ "rewards/format_reward/std": 0.18636635541915894,
500
+ "rewards/security_audit_reward/mean": -0.425,
501
+ "rewards/security_audit_reward/std": 0.15,
502
+ "step": 80,
503
+ "step_time": 38.869779922999626
504
+ },
505
+ {
506
+ "clip_ratio/high_max": 0.0,
507
+ "clip_ratio/high_mean": 0.0,
508
+ "clip_ratio/low_mean": 0.0,
509
+ "clip_ratio/low_min": 0.0,
510
+ "clip_ratio/region_mean": 0.0,
511
+ "completions/clipped_ratio": 0.5,
512
+ "completions/max_length": 512.0,
513
+ "completions/max_terminated_length": 321.2,
514
+ "completions/mean_length": 398.5,
515
+ "completions/mean_terminated_length": 253.0,
516
+ "completions/min_length": 183.0,
517
+ "completions/min_terminated_length": 183.0,
518
+ "entropy": 1.244500571489334,
519
+ "epoch": 0.5214723926380368,
520
+ "frac_reward_zero_std": 0.3,
521
+ "grad_norm": 2.146970272064209,
522
+ "learning_rate": 3.7116564417177916e-07,
523
+ "loss": 0.06171210408210755,
524
+ "num_tokens": 304148.0,
525
+ "reward": -0.22524999380111693,
526
+ "reward_std": 0.19191497713327407,
527
+ "rewards/format_reward/mean": 0.18250000327825547,
528
+ "rewards/format_reward/std": 0.19969657957553863,
529
+ "rewards/security_audit_reward/mean": -0.4,
530
+ "rewards/security_audit_reward/std": 0.2,
531
+ "step": 85,
532
+ "step_time": 39.297288996000134
533
+ },
534
+ {
535
+ "clip_ratio/high_max": 0.0,
536
+ "clip_ratio/high_mean": 0.0,
537
+ "clip_ratio/low_mean": 0.0,
538
+ "clip_ratio/low_min": 0.0,
539
+ "clip_ratio/region_mean": 0.0,
540
+ "completions/clipped_ratio": 0.55,
541
+ "completions/max_length": 512.0,
542
+ "completions/max_terminated_length": 360.0,
543
+ "completions/mean_length": 419.5,
544
+ "completions/mean_terminated_length": 325.6,
545
+ "completions/min_length": 291.2,
546
+ "completions/min_terminated_length": 291.2,
547
+ "entropy": 1.206581747531891,
548
+ "epoch": 0.5521472392638037,
549
+ "frac_reward_zero_std": 0.2,
550
+ "grad_norm": 2.1158456802368164,
551
+ "learning_rate": 3.634969325153374e-07,
552
+ "loss": -0.06664568185806274,
553
+ "num_tokens": 321680.0,
554
+ "reward": -0.23324998915195466,
555
+ "reward_std": 0.1919491995126009,
556
+ "rewards/format_reward/mean": 0.1675000049173832,
557
+ "rewards/format_reward/std": 0.19863576367497443,
558
+ "rewards/security_audit_reward/mean": -0.40499999523162844,
559
+ "rewards/security_audit_reward/std": 0.1899999976158142,
560
+ "step": 90,
561
+ "step_time": 38.49561594039933
562
+ },
563
+ {
564
+ "clip_ratio/high_max": 0.0,
565
+ "clip_ratio/high_mean": 0.0,
566
+ "clip_ratio/low_mean": 0.0,
567
+ "clip_ratio/low_min": 0.0,
568
+ "clip_ratio/region_mean": 0.0,
569
+ "completions/clipped_ratio": 0.4,
570
+ "completions/max_length": 469.2,
571
+ "completions/max_terminated_length": 364.0,
572
+ "completions/mean_length": 383.2,
573
+ "completions/mean_terminated_length": 291.8333374023438,
574
+ "completions/min_length": 217.6,
575
+ "completions/min_terminated_length": 217.6,
576
+ "entropy": 1.217250692844391,
577
+ "epoch": 0.5828220858895705,
578
+ "frac_reward_zero_std": 0.4,
579
+ "grad_norm": 4.098232269287109,
580
+ "learning_rate": 3.558282208588957e-07,
581
+ "loss": 0.05211906433105469,
582
+ "num_tokens": 339350.0,
583
+ "reward": -0.2119999945163727,
584
+ "reward_std": 0.19894140996038914,
585
+ "rewards/format_reward/mean": 0.1799999989569187,
586
+ "rewards/format_reward/std": 0.23350853994488716,
587
+ "rewards/security_audit_reward/mean": -0.37999999821186065,
588
+ "rewards/security_audit_reward/std": 0.1911805212497711,
589
+ "step": 95,
590
+ "step_time": 35.83687614579994
591
+ },
592
+ {
593
+ "clip_ratio/high_max": 0.0,
594
+ "clip_ratio/high_mean": 0.0,
595
+ "clip_ratio/low_mean": 0.0,
596
+ "clip_ratio/low_min": 0.0,
597
+ "clip_ratio/region_mean": 0.0,
598
+ "completions/clipped_ratio": 0.5,
599
+ "completions/max_length": 512.0,
600
+ "completions/max_terminated_length": 345.4,
601
+ "completions/mean_length": 377.45,
602
+ "completions/mean_terminated_length": 290.6333343505859,
603
+ "completions/min_length": 240.4,
604
+ "completions/min_terminated_length": 240.4,
605
+ "entropy": 1.2783292949199676,
606
+ "epoch": 0.6134969325153374,
607
+ "frac_reward_zero_std": 0.3,
608
+ "grad_norm": 2.361516237258911,
609
+ "learning_rate": 3.48159509202454e-07,
610
+ "loss": 0.06258203387260437,
611
+ "num_tokens": 356239.0,
612
+ "reward": -0.20649999231100083,
613
+ "reward_std": 0.18195689767599105,
614
+ "rewards/format_reward/mean": 0.24499999433755876,
615
+ "rewards/format_reward/std": 0.19310407042503358,
616
+ "rewards/security_audit_reward/mean": -0.4,
617
+ "rewards/security_audit_reward/std": 0.2,
618
+ "step": 100,
619
+ "step_time": 38.47846096040011
620
+ },
621
+ {
622
+ "clip_ratio/high_max": 0.0,
623
+ "clip_ratio/high_mean": 0.0,
624
+ "clip_ratio/low_mean": 0.0,
625
+ "clip_ratio/low_min": 0.0,
626
+ "clip_ratio/region_mean": 0.0,
627
+ "completions/clipped_ratio": 0.3,
628
+ "completions/max_length": 463.2,
629
+ "completions/max_terminated_length": 394.6,
630
+ "completions/mean_length": 335.8,
631
+ "completions/mean_terminated_length": 268.3,
632
+ "completions/min_length": 139.6,
633
+ "completions/min_terminated_length": 139.6,
634
+ "entropy": 1.2529696226119995,
635
+ "epoch": 0.6441717791411042,
636
+ "frac_reward_zero_std": 0.2,
637
+ "grad_norm": 3.356074094772339,
638
+ "learning_rate": 3.4049079754601224e-07,
639
+ "loss": 0.003340443968772888,
640
+ "num_tokens": 373237.0,
641
+ "reward": -0.2567499876022339,
642
+ "reward_std": 0.27417250275611876,
643
+ "rewards/format_reward/mean": 0.14750000461935997,
644
+ "rewards/format_reward/std": 0.19759280756115913,
645
+ "rewards/security_audit_reward/mean": -0.4299999952316284,
646
+ "rewards/security_audit_reward/std": 0.3186576545238495,
647
+ "step": 105,
648
+ "step_time": 35.43083410320014
649
+ },
650
+ {
651
+ "clip_ratio/high_max": 0.0,
652
+ "clip_ratio/high_mean": 0.0,
653
+ "clip_ratio/low_mean": 0.0,
654
+ "clip_ratio/low_min": 0.0,
655
+ "clip_ratio/region_mean": 0.0,
656
+ "completions/clipped_ratio": 0.45,
657
+ "completions/max_length": 512.0,
658
+ "completions/max_terminated_length": 372.8,
659
+ "completions/mean_length": 419.15,
660
+ "completions/mean_terminated_length": 319.2666748046875,
661
+ "completions/min_length": 266.8,
662
+ "completions/min_terminated_length": 266.8,
663
+ "entropy": 1.1685741186141967,
664
+ "epoch": 0.6748466257668712,
665
+ "frac_reward_zero_std": 0.2,
666
+ "grad_norm": 4.318619728088379,
667
+ "learning_rate": 3.3282208588957055e-07,
668
+ "loss": -0.026089027523994446,
669
+ "num_tokens": 391784.0,
670
+ "reward": -0.2662499874830246,
671
+ "reward_std": 0.07884115856140853,
672
+ "rewards/format_reward/mean": 0.16250000558793545,
673
+ "rewards/format_reward/std": 0.14070439487695693,
674
+ "rewards/security_audit_reward/mean": -0.45,
675
+ "rewards/security_audit_reward/std": 0.05773502588272095,
676
+ "step": 110,
677
+ "step_time": 38.89281254739999
678
+ },
679
+ {
680
+ "clip_ratio/high_max": 0.0,
681
+ "clip_ratio/high_mean": 0.0,
682
+ "clip_ratio/low_mean": 0.0,
683
+ "clip_ratio/low_min": 0.0,
684
+ "clip_ratio/region_mean": 0.0,
685
+ "completions/clipped_ratio": 0.3,
686
+ "completions/max_length": 468.0,
687
+ "completions/max_terminated_length": 337.6,
688
+ "completions/mean_length": 330.1,
689
+ "completions/mean_terminated_length": 240.4166687011719,
690
+ "completions/min_length": 160.8,
691
+ "completions/min_terminated_length": 160.8,
692
+ "entropy": 1.2954379856586455,
693
+ "epoch": 0.7055214723926381,
694
+ "frac_reward_zero_std": 0.2,
695
+ "grad_norm": 3.293928384780884,
696
+ "learning_rate": 3.251533742331288e-07,
697
+ "loss": 0.17276796102523803,
698
+ "num_tokens": 408446.0,
699
+ "reward": -0.22849999666213988,
700
+ "reward_std": 0.1390242099761963,
701
+ "rewards/format_reward/mean": 0.2300000011920929,
702
+ "rewards/format_reward/std": 0.23302415013313293,
703
+ "rewards/security_audit_reward/mean": -0.425,
704
+ "rewards/security_audit_reward/std": 0.10773502588272095,
705
+ "step": 115,
706
+ "step_time": 35.73082293679981
707
+ },
708
+ {
709
+ "clip_ratio/high_max": 0.0,
710
+ "clip_ratio/high_mean": 0.0,
711
+ "clip_ratio/low_mean": 0.0,
712
+ "clip_ratio/low_min": 0.0,
713
+ "clip_ratio/region_mean": 0.0,
714
+ "completions/clipped_ratio": 0.55,
715
+ "completions/max_length": 512.0,
716
+ "completions/max_terminated_length": 281.2,
717
+ "completions/mean_length": 400.5,
718
+ "completions/mean_terminated_length": 248.73333740234375,
719
+ "completions/min_length": 211.8,
720
+ "completions/min_terminated_length": 211.8,
721
+ "entropy": 1.2283548831939697,
722
+ "epoch": 0.7361963190184049,
723
+ "frac_reward_zero_std": 0.2,
724
+ "grad_norm": 2.632479190826416,
725
+ "learning_rate": 3.174846625766871e-07,
726
+ "loss": 0.05111231803894043,
727
+ "num_tokens": 426822.0,
728
+ "reward": -0.22074998915195465,
729
+ "reward_std": 0.15957241374999284,
730
+ "rewards/format_reward/mean": 0.1975000012665987,
731
+ "rewards/format_reward/std": 0.1847505249083042,
732
+ "rewards/security_audit_reward/mean": -0.4,
733
+ "rewards/security_audit_reward/std": 0.15773502588272095,
734
+ "step": 120,
735
+ "step_time": 39.14503100519996
736
+ },
737
+ {
738
+ "clip_ratio/high_max": 0.0,
739
+ "clip_ratio/high_mean": 0.0,
740
+ "clip_ratio/low_mean": 0.0,
741
+ "clip_ratio/low_min": 0.0,
742
+ "clip_ratio/region_mean": 0.0,
743
+ "completions/clipped_ratio": 0.5,
744
+ "completions/max_length": 512.0,
745
+ "completions/max_terminated_length": 340.2,
746
+ "completions/mean_length": 401.4,
747
+ "completions/mean_terminated_length": 236.03333740234376,
748
+ "completions/min_length": 245.2,
749
+ "completions/min_terminated_length": 142.8,
750
+ "entropy": 1.307636547088623,
751
+ "epoch": 0.7668711656441718,
752
+ "frac_reward_zero_std": 0.3,
753
+ "grad_norm": 5.566491603851318,
754
+ "learning_rate": 3.0981595092024537e-07,
755
+ "loss": 0.003215853124856949,
756
+ "num_tokens": 444322.0,
757
+ "reward": -0.11199999079108239,
758
+ "reward_std": 0.2506739288568497,
759
+ "rewards/format_reward/mean": 0.2449999988079071,
760
+ "rewards/format_reward/std": 0.20622505843639374,
761
+ "rewards/security_audit_reward/mean": -0.26500000059604645,
762
+ "rewards/security_audit_reward/std": 0.278915548324585,
763
+ "step": 125,
764
+ "step_time": 38.70229864360026
765
+ },
766
+ {
767
+ "clip_ratio/high_max": 0.0,
768
+ "clip_ratio/high_mean": 0.0,
769
+ "clip_ratio/low_mean": 0.0,
770
+ "clip_ratio/low_min": 0.0,
771
+ "clip_ratio/region_mean": 0.0,
772
+ "completions/clipped_ratio": 0.35,
773
+ "completions/max_length": 512.0,
774
+ "completions/max_terminated_length": 390.6,
775
+ "completions/mean_length": 367.0,
776
+ "completions/mean_terminated_length": 264.0000061035156,
777
+ "completions/min_length": 123.2,
778
+ "completions/min_terminated_length": 123.2,
779
+ "entropy": 1.248900693655014,
780
+ "epoch": 0.7975460122699386,
781
+ "frac_reward_zero_std": 0.2,
782
+ "grad_norm": 4.395384311676025,
783
+ "learning_rate": 3.021472392638036e-07,
784
+ "loss": 0.06482647061347961,
785
+ "num_tokens": 461894.0,
786
+ "reward": -0.2042499899864197,
787
+ "reward_std": 0.17003463432192803,
788
+ "rewards/format_reward/mean": 0.2525000125169754,
789
+ "rewards/format_reward/std": 0.21560870110988617,
790
+ "rewards/security_audit_reward/mean": -0.4,
791
+ "rewards/security_audit_reward/std": 0.15773502588272095,
792
+ "step": 130,
793
+ "step_time": 39.174271353800215
794
+ },
795
+ {
796
+ "clip_ratio/high_max": 0.0,
797
+ "clip_ratio/high_mean": 0.0,
798
+ "clip_ratio/low_mean": 0.0,
799
+ "clip_ratio/low_min": 0.0,
800
+ "clip_ratio/region_mean": 0.0,
801
+ "completions/clipped_ratio": 0.35,
802
+ "completions/max_length": 512.0,
803
+ "completions/max_terminated_length": 370.0,
804
+ "completions/mean_length": 359.95,
805
+ "completions/mean_terminated_length": 276.9666687011719,
806
+ "completions/min_length": 203.8,
807
+ "completions/min_terminated_length": 203.8,
808
+ "entropy": 1.3299469709396363,
809
+ "epoch": 0.8282208588957055,
810
+ "frac_reward_zero_std": 0.0,
811
+ "grad_norm": 4.3037519454956055,
812
+ "learning_rate": 2.94478527607362e-07,
813
+ "loss": 0.026153716444969177,
814
+ "num_tokens": 478783.0,
815
+ "reward": -0.19949999153614045,
816
+ "reward_std": 0.15296672135591508,
817
+ "rewards/format_reward/mean": 0.24500000178813935,
818
+ "rewards/format_reward/std": 0.2273508906364441,
819
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
820
+ "rewards/security_audit_reward/std": 0.12891554832458496,
821
+ "step": 135,
822
+ "step_time": 38.658669441000164
823
+ },
824
+ {
825
+ "clip_ratio/high_max": 0.0,
826
+ "clip_ratio/high_mean": 0.0,
827
+ "clip_ratio/low_mean": 0.0,
828
+ "clip_ratio/low_min": 0.0,
829
+ "clip_ratio/region_mean": 0.0,
830
+ "completions/clipped_ratio": 0.45,
831
+ "completions/max_length": 512.0,
832
+ "completions/max_terminated_length": 354.0,
833
+ "completions/mean_length": 380.3,
834
+ "completions/mean_terminated_length": 271.6666687011719,
835
+ "completions/min_length": 206.4,
836
+ "completions/min_terminated_length": 206.4,
837
+ "entropy": 1.0997539341449738,
838
+ "epoch": 0.8588957055214724,
839
+ "frac_reward_zero_std": 0.2,
840
+ "grad_norm": 2.3693976402282715,
841
+ "learning_rate": 2.8680981595092024e-07,
842
+ "loss": -0.01876506209373474,
843
+ "num_tokens": 496243.0,
844
+ "reward": -0.17974998727440833,
845
+ "reward_std": 0.18710523881018162,
846
+ "rewards/format_reward/mean": 0.21750000193715097,
847
+ "rewards/format_reward/std": 0.1843859799206257,
848
+ "rewards/security_audit_reward/mean": -0.35,
849
+ "rewards/security_audit_reward/std": 0.20347774028778076,
850
+ "step": 140,
851
+ "step_time": 39.171343391999834
852
+ },
853
+ {
854
+ "clip_ratio/high_max": 0.0,
855
+ "clip_ratio/high_mean": 0.0,
856
+ "clip_ratio/low_mean": 0.0,
857
+ "clip_ratio/low_min": 0.0,
858
+ "clip_ratio/region_mean": 0.0,
859
+ "completions/clipped_ratio": 0.35,
860
+ "completions/max_length": 456.2,
861
+ "completions/max_terminated_length": 355.2,
862
+ "completions/mean_length": 348.35,
863
+ "completions/mean_terminated_length": 270.3,
864
+ "completions/min_length": 164.8,
865
+ "completions/min_terminated_length": 164.8,
866
+ "entropy": 1.2017314374446868,
867
+ "epoch": 0.8895705521472392,
868
+ "frac_reward_zero_std": 0.3,
869
+ "grad_norm": 4.230531215667725,
870
+ "learning_rate": 2.791411042944785e-07,
871
+ "loss": 0.0029310762882232668,
872
+ "num_tokens": 513422.0,
873
+ "reward": -0.1637499898672104,
874
+ "reward_std": 0.2463478922843933,
875
+ "rewards/format_reward/mean": 0.2475000023841858,
876
+ "rewards/format_reward/std": 0.2085829883813858,
877
+ "rewards/security_audit_reward/mean": -0.34000000059604646,
878
+ "rewards/security_audit_reward/std": 0.26830023527145386,
879
+ "step": 145,
880
+ "step_time": 34.95672115479992
881
+ },
882
+ {
883
+ "clip_ratio/high_max": 0.0,
884
+ "clip_ratio/high_mean": 0.0,
885
+ "clip_ratio/low_mean": 0.0,
886
+ "clip_ratio/low_min": 0.0,
887
+ "clip_ratio/region_mean": 0.0,
888
+ "completions/clipped_ratio": 0.25,
889
+ "completions/max_length": 482.4,
890
+ "completions/max_terminated_length": 431.6,
891
+ "completions/mean_length": 355.35,
892
+ "completions/mean_terminated_length": 319.4166687011719,
893
+ "completions/min_length": 212.0,
894
+ "completions/min_terminated_length": 212.0,
895
+ "entropy": 1.258862280845642,
896
+ "epoch": 0.9202453987730062,
897
+ "frac_reward_zero_std": 0.2,
898
+ "grad_norm": 6.071740627288818,
899
+ "learning_rate": 2.714723926380368e-07,
900
+ "loss": 0.0822126567363739,
901
+ "num_tokens": 530643.0,
902
+ "reward": -0.1912499874830246,
903
+ "reward_std": 0.1670845106244087,
904
+ "rewards/format_reward/mean": 0.27249999940395353,
905
+ "rewards/format_reward/std": 0.1733592666685581,
906
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
907
+ "rewards/security_audit_reward/std": 0.17118052244186402,
908
+ "step": 150,
909
+ "step_time": 37.038782767599876
910
+ },
911
+ {
912
+ "clip_ratio/high_max": 0.0,
913
+ "clip_ratio/high_mean": 0.0,
914
+ "clip_ratio/low_mean": 0.0,
915
+ "clip_ratio/low_min": 0.0,
916
+ "clip_ratio/region_mean": 0.0,
917
+ "completions/clipped_ratio": 0.2,
918
+ "completions/max_length": 442.2,
919
+ "completions/max_terminated_length": 316.0,
920
+ "completions/mean_length": 283.75,
921
+ "completions/mean_terminated_length": 237.85000305175782,
922
+ "completions/min_length": 166.2,
923
+ "completions/min_terminated_length": 166.2,
924
+ "entropy": 1.4853489220142364,
925
+ "epoch": 0.950920245398773,
926
+ "frac_reward_zero_std": 0.0,
927
+ "grad_norm": 5.2570695877075195,
928
+ "learning_rate": 2.6380368098159506e-07,
929
+ "loss": 0.11004064083099366,
930
+ "num_tokens": 545966.0,
931
+ "reward": -0.15949999541044235,
932
+ "reward_std": 0.19192611873149873,
933
+ "rewards/format_reward/mean": 0.28500000238418577,
934
+ "rewards/format_reward/std": 0.22434256076812745,
935
+ "rewards/security_audit_reward/mean": -0.35,
936
+ "rewards/security_audit_reward/std": 0.2,
937
+ "step": 155,
938
+ "step_time": 33.86747411140077
939
+ },
940
+ {
941
+ "clip_ratio/high_max": 0.0,
942
+ "clip_ratio/high_mean": 0.0,
943
+ "clip_ratio/low_mean": 0.0,
944
+ "clip_ratio/low_min": 0.0,
945
+ "clip_ratio/region_mean": 0.0,
946
+ "completions/clipped_ratio": 0.55,
947
+ "completions/max_length": 512.0,
948
+ "completions/max_terminated_length": 340.8,
949
+ "completions/mean_length": 411.45,
950
+ "completions/mean_terminated_length": 265.9666687011719,
951
+ "completions/min_length": 192.0,
952
+ "completions/min_terminated_length": 192.0,
953
+ "entropy": 1.0828768193721772,
954
+ "epoch": 0.9815950920245399,
955
+ "frac_reward_zero_std": 0.2,
956
+ "grad_norm": 2.6537587642669678,
957
+ "learning_rate": 2.5613496932515337e-07,
958
+ "loss": 0.03556116819381714,
959
+ "num_tokens": 563683.0,
960
+ "reward": -0.20099999010562897,
961
+ "reward_std": 0.1888158166781068,
962
+ "rewards/format_reward/mean": 0.24000000059604645,
963
+ "rewards/format_reward/std": 0.16870398968458175,
964
+ "rewards/security_audit_reward/mean": -0.3899999976158142,
965
+ "rewards/security_audit_reward/std": 0.2199999988079071,
966
+ "step": 160,
967
+ "step_time": 38.665304075799575
968
+ },
969
+ {
970
+ "clip_ratio/high_max": 0.0,
971
+ "clip_ratio/high_mean": 0.0,
972
+ "clip_ratio/low_mean": 0.0,
973
+ "clip_ratio/low_min": 0.0,
974
+ "clip_ratio/region_mean": 0.0,
975
+ "completions/clipped_ratio": 0.3,
976
+ "completions/max_length": 483.2,
977
+ "completions/max_terminated_length": 393.2,
978
+ "completions/mean_length": 359.2,
979
+ "completions/mean_terminated_length": 303.3666687011719,
980
+ "completions/min_length": 178.2,
981
+ "completions/min_terminated_length": 178.2,
982
+ "entropy": 1.1811485469341279,
983
+ "epoch": 1.0122699386503067,
984
+ "frac_reward_zero_std": 0.0,
985
+ "grad_norm": 4.203860282897949,
986
+ "learning_rate": 2.4846625766871163e-07,
987
+ "loss": -0.02532302737236023,
988
+ "num_tokens": 580183.0,
989
+ "reward": -0.1227499857544899,
990
+ "reward_std": 0.2651766210794449,
991
+ "rewards/format_reward/mean": 0.2675000011920929,
992
+ "rewards/format_reward/std": 0.24115291833877564,
993
+ "rewards/security_audit_reward/mean": -0.2899999976158142,
994
+ "rewards/security_audit_reward/std": 0.2812127649784088,
995
+ "step": 165,
996
+ "step_time": 36.34888075860035
997
+ },
998
+ {
999
+ "clip_ratio/high_max": 0.0,
1000
+ "clip_ratio/high_mean": 0.0,
1001
+ "clip_ratio/low_mean": 0.0,
1002
+ "clip_ratio/low_min": 0.0,
1003
+ "clip_ratio/region_mean": 0.0,
1004
+ "completions/clipped_ratio": 0.4,
1005
+ "completions/max_length": 512.0,
1006
+ "completions/max_terminated_length": 345.8,
1007
+ "completions/mean_length": 358.5,
1008
+ "completions/mean_terminated_length": 235.96666870117187,
1009
+ "completions/min_length": 123.0,
1010
+ "completions/min_terminated_length": 123.0,
1011
+ "entropy": 1.2863860994577407,
1012
+ "epoch": 1.0429447852760736,
1013
+ "frac_reward_zero_std": 0.2,
1014
+ "grad_norm": 2.71185302734375,
1015
+ "learning_rate": 2.4079754601226994e-07,
1016
+ "loss": 0.12254136800765991,
1017
+ "num_tokens": 597345.0,
1018
+ "reward": -0.20274999886751174,
1019
+ "reward_std": 0.1825057201087475,
1020
+ "rewards/format_reward/mean": 0.25750000327825545,
1021
+ "rewards/format_reward/std": 0.19183385372161865,
1022
+ "rewards/security_audit_reward/mean": -0.4,
1023
+ "rewards/security_audit_reward/std": 0.19711971282958984,
1024
+ "step": 170,
1025
+ "step_time": 38.96664929399922
1026
+ },
1027
+ {
1028
+ "clip_ratio/high_max": 0.0,
1029
+ "clip_ratio/high_mean": 0.0,
1030
+ "clip_ratio/low_mean": 0.0,
1031
+ "clip_ratio/low_min": 0.0,
1032
+ "clip_ratio/region_mean": 0.0,
1033
+ "completions/clipped_ratio": 0.5,
1034
+ "completions/max_length": 512.0,
1035
+ "completions/max_terminated_length": 323.4,
1036
+ "completions/mean_length": 407.1,
1037
+ "completions/mean_terminated_length": 234.40000610351564,
1038
+ "completions/min_length": 242.8,
1039
+ "completions/min_terminated_length": 140.4,
1040
+ "entropy": 1.179810070991516,
1041
+ "epoch": 1.0736196319018405,
1042
+ "frac_reward_zero_std": 0.2,
1043
+ "grad_norm": 4.413055419921875,
1044
+ "learning_rate": 2.331288343558282e-07,
1045
+ "loss": 0.041037318110466,
1046
+ "num_tokens": 615063.0,
1047
+ "reward": -0.20374999046325684,
1048
+ "reward_std": 0.2052689865231514,
1049
+ "rewards/format_reward/mean": 0.31250000894069674,
1050
+ "rewards/format_reward/std": 0.22553626000881194,
1051
+ "rewards/security_audit_reward/mean": -0.425,
1052
+ "rewards/security_audit_reward/std": 0.23164966106414794,
1053
+ "step": 175,
1054
+ "step_time": 39.00492364500023
1055
+ },
1056
+ {
1057
+ "clip_ratio/high_max": 0.0,
1058
+ "clip_ratio/high_mean": 0.0,
1059
+ "clip_ratio/low_mean": 0.0,
1060
+ "clip_ratio/low_min": 0.0,
1061
+ "clip_ratio/region_mean": 0.0,
1062
+ "completions/clipped_ratio": 0.3,
1063
+ "completions/max_length": 511.6,
1064
+ "completions/max_terminated_length": 424.0,
1065
+ "completions/mean_length": 399.45,
1066
+ "completions/mean_terminated_length": 347.9166748046875,
1067
+ "completions/min_length": 262.6,
1068
+ "completions/min_terminated_length": 262.6,
1069
+ "entropy": 1.1026120364665986,
1070
+ "epoch": 1.1042944785276074,
1071
+ "frac_reward_zero_std": 0.0,
1072
+ "grad_norm": 4.408414840698242,
1073
+ "learning_rate": 2.254601226993865e-07,
1074
+ "loss": 0.058766734600067136,
1075
+ "num_tokens": 632696.0,
1076
+ "reward": -0.16599998623132706,
1077
+ "reward_std": 0.26377752125263215,
1078
+ "rewards/format_reward/mean": 0.24000000655651094,
1079
+ "rewards/format_reward/std": 0.21778101623058319,
1080
+ "rewards/security_audit_reward/mean": -0.34000000059604646,
1081
+ "rewards/security_audit_reward/std": 0.3105652093887329,
1082
+ "step": 180,
1083
+ "step_time": 39.09616019519963
1084
+ },
1085
+ {
1086
+ "clip_ratio/high_max": 0.0,
1087
+ "clip_ratio/high_mean": 0.0,
1088
+ "clip_ratio/low_mean": 0.0,
1089
+ "clip_ratio/low_min": 0.0,
1090
+ "clip_ratio/region_mean": 0.0,
1091
+ "completions/clipped_ratio": 0.2,
1092
+ "completions/max_length": 477.4,
1093
+ "completions/max_terminated_length": 421.6,
1094
+ "completions/mean_length": 322.45,
1095
+ "completions/mean_terminated_length": 273.1000030517578,
1096
+ "completions/min_length": 156.0,
1097
+ "completions/min_terminated_length": 156.0,
1098
+ "entropy": 1.2888785600662231,
1099
+ "epoch": 1.1349693251533743,
1100
+ "frac_reward_zero_std": 0.0,
1101
+ "grad_norm": 3.6877431869506836,
1102
+ "learning_rate": 2.1779141104294476e-07,
1103
+ "loss": -0.0771723210811615,
1104
+ "num_tokens": 649353.0,
1105
+ "reward": -0.1799999952316284,
1106
+ "reward_std": 0.3134476348757744,
1107
+ "rewards/format_reward/mean": 0.2750000089406967,
1108
+ "rewards/format_reward/std": 0.24135999679565429,
1109
+ "rewards/security_audit_reward/mean": -0.375,
1110
+ "rewards/security_audit_reward/std": 0.3593961834907532,
1111
+ "step": 185,
1112
+ "step_time": 36.71157897000012
1113
+ },
1114
+ {
1115
+ "clip_ratio/high_max": 0.0,
1116
+ "clip_ratio/high_mean": 0.0,
1117
+ "clip_ratio/low_mean": 0.0,
1118
+ "clip_ratio/low_min": 0.0,
1119
+ "clip_ratio/region_mean": 0.0,
1120
+ "completions/clipped_ratio": 0.3,
1121
+ "completions/max_length": 495.0,
1122
+ "completions/max_terminated_length": 351.0,
1123
+ "completions/mean_length": 328.8,
1124
+ "completions/mean_terminated_length": 243.7,
1125
+ "completions/min_length": 164.8,
1126
+ "completions/min_terminated_length": 164.8,
1127
+ "entropy": 1.4585140287876128,
1128
+ "epoch": 1.165644171779141,
1129
+ "frac_reward_zero_std": 0.2,
1130
+ "grad_norm": 4.987306118011475,
1131
+ "learning_rate": 2.1012269938650307e-07,
1132
+ "loss": -0.15080010890960693,
1133
+ "num_tokens": 665513.0,
1134
+ "reward": -0.050499990582466125,
1135
+ "reward_std": 0.2684710592031479,
1136
+ "rewards/format_reward/mean": 0.31000000387430193,
1137
+ "rewards/format_reward/std": 0.1994625985622406,
1138
+ "rewards/security_audit_reward/mean": -0.20500000119209288,
1139
+ "rewards/security_audit_reward/std": 0.31255176067352297,
1140
+ "step": 190,
1141
+ "step_time": 37.63543628939988
1142
+ },
1143
+ {
1144
+ "clip_ratio/high_max": 0.0,
1145
+ "clip_ratio/high_mean": 0.0,
1146
+ "clip_ratio/low_mean": 0.0,
1147
+ "clip_ratio/low_min": 0.0,
1148
+ "clip_ratio/region_mean": 0.0,
1149
+ "completions/clipped_ratio": 0.5,
1150
+ "completions/max_length": 512.0,
1151
+ "completions/max_terminated_length": 307.8,
1152
+ "completions/mean_length": 391.75,
1153
+ "completions/mean_terminated_length": 253.06666870117186,
1154
+ "completions/min_length": 178.0,
1155
+ "completions/min_terminated_length": 178.0,
1156
+ "entropy": 1.1293343544006347,
1157
+ "epoch": 1.196319018404908,
1158
+ "frac_reward_zero_std": 0.1,
1159
+ "grad_norm": 5.247244358062744,
1160
+ "learning_rate": 2.0245398773006135e-07,
1161
+ "loss": -0.04229157567024231,
1162
+ "num_tokens": 683268.0,
1163
+ "reward": -0.10224998965859414,
1164
+ "reward_std": 0.19266743455082178,
1165
+ "rewards/format_reward/mean": 0.3124999929219484,
1166
+ "rewards/format_reward/std": 0.1390557773411274,
1167
+ "rewards/security_audit_reward/mean": -0.2800000011920929,
1168
+ "rewards/security_audit_reward/std": 0.23863712549209595,
1169
+ "step": 195,
1170
+ "step_time": 38.93936442300037
1171
+ },
1172
+ {
1173
+ "clip_ratio/high_max": 0.0,
1174
+ "clip_ratio/high_mean": 0.0,
1175
+ "clip_ratio/low_mean": 0.0,
1176
+ "clip_ratio/low_min": 0.0,
1177
+ "clip_ratio/region_mean": 0.0,
1178
+ "completions/clipped_ratio": 0.4,
1179
+ "completions/max_length": 485.0,
1180
+ "completions/max_terminated_length": 346.2,
1181
+ "completions/mean_length": 364.6,
1182
+ "completions/mean_terminated_length": 265.8166748046875,
1183
+ "completions/min_length": 165.8,
1184
+ "completions/min_terminated_length": 165.8,
1185
+ "entropy": 0.8287177711725235,
1186
+ "epoch": 1.2269938650306749,
1187
+ "frac_reward_zero_std": 0.1,
1188
+ "grad_norm": 2.4230945110321045,
1189
+ "learning_rate": 1.9478527607361963e-07,
1190
+ "loss": -0.05633368492126465,
1191
+ "num_tokens": 700760.0,
1192
+ "reward": -0.1807499848306179,
1193
+ "reward_std": 0.18529897555708885,
1194
+ "rewards/format_reward/mean": 0.3075000137090683,
1195
+ "rewards/format_reward/std": 0.15467575192451477,
1196
+ "rewards/security_audit_reward/mean": -0.39000000059604645,
1197
+ "rewards/security_audit_reward/std": 0.2105652093887329,
1198
+ "step": 200,
1199
+ "step_time": 37.065423558799736
1200
+ },
1201
+ {
1202
+ "clip_ratio/high_max": 0.0,
1203
+ "clip_ratio/high_mean": 0.0,
1204
+ "clip_ratio/low_mean": 0.0,
1205
+ "clip_ratio/low_min": 0.0,
1206
+ "clip_ratio/region_mean": 0.0,
1207
+ "completions/clipped_ratio": 0.25,
1208
+ "completions/max_length": 431.6,
1209
+ "completions/max_terminated_length": 338.8,
1210
+ "completions/mean_length": 284.1,
1211
+ "completions/mean_terminated_length": 225.65,
1212
+ "completions/min_length": 119.4,
1213
+ "completions/min_terminated_length": 119.4,
1214
+ "entropy": 1.2736368715763091,
1215
+ "epoch": 1.2576687116564418,
1216
+ "frac_reward_zero_std": 0.1,
1217
+ "grad_norm": 4.797567367553711,
1218
+ "learning_rate": 1.8711656441717791e-07,
1219
+ "loss": 0.08297693133354186,
1220
+ "num_tokens": 716344.0,
1221
+ "reward": -0.07274999544024467,
1222
+ "reward_std": 0.24350565671920776,
1223
+ "rewards/format_reward/mean": 0.31749999821186065,
1224
+ "rewards/format_reward/std": 0.19669782146811485,
1225
+ "rewards/security_audit_reward/mean": -0.23999999985098838,
1226
+ "rewards/security_audit_reward/std": 0.2692204549908638,
1227
+ "step": 205,
1228
+ "step_time": 33.11877055760014
1229
+ },
1230
+ {
1231
+ "clip_ratio/high_max": 0.0,
1232
+ "clip_ratio/high_mean": 0.0,
1233
+ "clip_ratio/low_mean": 0.0,
1234
+ "clip_ratio/low_min": 0.0,
1235
+ "clip_ratio/region_mean": 0.0,
1236
+ "completions/clipped_ratio": 0.25,
1237
+ "completions/max_length": 484.0,
1238
+ "completions/max_terminated_length": 403.2,
1239
+ "completions/mean_length": 335.45,
1240
+ "completions/mean_terminated_length": 276.4500030517578,
1241
+ "completions/min_length": 173.8,
1242
+ "completions/min_terminated_length": 173.8,
1243
+ "entropy": 1.1084223449230195,
1244
+ "epoch": 1.2883435582822087,
1245
+ "frac_reward_zero_std": 0.1,
1246
+ "grad_norm": 2.4603023529052734,
1247
+ "learning_rate": 1.7944785276073617e-07,
1248
+ "loss": 0.07945090532302856,
1249
+ "num_tokens": 733245.0,
1250
+ "reward": -0.13774999380111694,
1251
+ "reward_std": 0.2730386942625046,
1252
+ "rewards/format_reward/mean": 0.2174999989569187,
1253
+ "rewards/format_reward/std": 0.22229814901947975,
1254
+ "rewards/security_audit_reward/mean": -0.29000000059604647,
1255
+ "rewards/security_audit_reward/std": 0.3105652093887329,
1256
+ "step": 210,
1257
+ "step_time": 37.03591289120122
1258
+ },
1259
+ {
1260
+ "clip_ratio/high_max": 0.0,
1261
+ "clip_ratio/high_mean": 0.0,
1262
+ "clip_ratio/low_mean": 0.0,
1263
+ "clip_ratio/low_min": 0.0,
1264
+ "clip_ratio/region_mean": 0.0,
1265
+ "completions/clipped_ratio": 0.4,
1266
+ "completions/max_length": 512.0,
1267
+ "completions/max_terminated_length": 361.8,
1268
+ "completions/mean_length": 345.0,
1269
+ "completions/mean_terminated_length": 261.3000030517578,
1270
+ "completions/min_length": 184.8,
1271
+ "completions/min_terminated_length": 184.8,
1272
+ "entropy": 1.2274070978164673,
1273
+ "epoch": 1.3190184049079754,
1274
+ "frac_reward_zero_std": 0.0,
1275
+ "grad_norm": 4.573819637298584,
1276
+ "learning_rate": 1.7177914110429448e-07,
1277
+ "loss": -0.07497722506523133,
1278
+ "num_tokens": 749917.0,
1279
+ "reward": -0.01174999624490738,
1280
+ "reward_std": 0.3002330154180527,
1281
+ "rewards/format_reward/mean": 0.3225000023841858,
1282
+ "rewards/format_reward/std": 0.1751384623348713,
1283
+ "rewards/security_audit_reward/mean": -0.15499999821186067,
1284
+ "rewards/security_audit_reward/std": 0.3648489773273468,
1285
+ "step": 215,
1286
+ "step_time": 38.89012140319937
1287
+ },
1288
+ {
1289
+ "clip_ratio/high_max": 0.0,
1290
+ "clip_ratio/high_mean": 0.0,
1291
+ "clip_ratio/low_mean": 0.0,
1292
+ "clip_ratio/low_min": 0.0,
1293
+ "clip_ratio/region_mean": 0.0,
1294
+ "completions/clipped_ratio": 0.2,
1295
+ "completions/max_length": 481.8,
1296
+ "completions/max_terminated_length": 344.6,
1297
+ "completions/mean_length": 292.3,
1298
+ "completions/mean_terminated_length": 234.26666870117188,
1299
+ "completions/min_length": 122.2,
1300
+ "completions/min_terminated_length": 122.2,
1301
+ "entropy": 1.1711494624614716,
1302
+ "epoch": 1.3496932515337423,
1303
+ "frac_reward_zero_std": 0.1,
1304
+ "grad_norm": 3.879939556121826,
1305
+ "learning_rate": 1.6411042944785276e-07,
1306
+ "loss": 0.06901218891143798,
1307
+ "num_tokens": 765457.0,
1308
+ "reward": -0.2002499908208847,
1309
+ "reward_std": 0.19745510853827,
1310
+ "rewards/format_reward/mean": 0.20749999657273294,
1311
+ "rewards/format_reward/std": 0.2098293460905552,
1312
+ "rewards/security_audit_reward/mean": -0.375,
1313
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1314
+ "step": 220,
1315
+ "step_time": 36.50884771559977
1316
+ },
1317
+ {
1318
+ "clip_ratio/high_max": 0.0,
1319
+ "clip_ratio/high_mean": 0.0,
1320
+ "clip_ratio/low_mean": 0.0,
1321
+ "clip_ratio/low_min": 0.0,
1322
+ "clip_ratio/region_mean": 0.0,
1323
+ "completions/clipped_ratio": 0.35,
1324
+ "completions/max_length": 451.4,
1325
+ "completions/max_terminated_length": 281.8,
1326
+ "completions/mean_length": 310.85,
1327
+ "completions/mean_terminated_length": 210.58333435058594,
1328
+ "completions/min_length": 144.0,
1329
+ "completions/min_terminated_length": 144.0,
1330
+ "entropy": 1.4326449751853942,
1331
+ "epoch": 1.3803680981595092,
1332
+ "frac_reward_zero_std": 0.0,
1333
+ "grad_norm": 5.483170986175537,
1334
+ "learning_rate": 1.5644171779141104e-07,
1335
+ "loss": -0.03490494191646576,
1336
+ "num_tokens": 782226.0,
1337
+ "reward": -0.14649999141693115,
1338
+ "reward_std": 0.19791007936000823,
1339
+ "rewards/format_reward/mean": 0.3049999952316284,
1340
+ "rewards/format_reward/std": 0.19433450996875762,
1341
+ "rewards/security_audit_reward/mean": -0.3399999998509884,
1342
+ "rewards/security_audit_reward/std": 0.22000000029802322,
1343
+ "step": 225,
1344
+ "step_time": 35.079213985799655
1345
+ },
1346
+ {
1347
+ "clip_ratio/high_max": 0.0,
1348
+ "clip_ratio/high_mean": 0.0,
1349
+ "clip_ratio/low_mean": 0.0,
1350
+ "clip_ratio/low_min": 0.0,
1351
+ "clip_ratio/region_mean": 0.0,
1352
+ "completions/clipped_ratio": 0.35,
1353
+ "completions/max_length": 511.2,
1354
+ "completions/max_terminated_length": 315.2,
1355
+ "completions/mean_length": 338.15,
1356
+ "completions/mean_terminated_length": 226.28333435058593,
1357
+ "completions/min_length": 134.8,
1358
+ "completions/min_terminated_length": 134.8,
1359
+ "entropy": 1.1364098012447357,
1360
+ "epoch": 1.4110429447852761,
1361
+ "frac_reward_zero_std": 0.2,
1362
+ "grad_norm": 3.3578364849090576,
1363
+ "learning_rate": 1.4877300613496933e-07,
1364
+ "loss": 0.0896155834197998,
1365
+ "num_tokens": 798571.0,
1366
+ "reward": -0.11574998870491982,
1367
+ "reward_std": 0.19651760943233967,
1368
+ "rewards/format_reward/mean": 0.2674999989569187,
1369
+ "rewards/format_reward/std": 0.15046989992260934,
1370
+ "rewards/security_audit_reward/mean": -0.27999999821186067,
1371
+ "rewards/security_audit_reward/std": 0.2297215759754181,
1372
+ "step": 230,
1373
+ "step_time": 38.61798697480081
1374
+ },
1375
+ {
1376
+ "clip_ratio/high_max": 0.0,
1377
+ "clip_ratio/high_mean": 0.0,
1378
+ "clip_ratio/low_mean": 0.0,
1379
+ "clip_ratio/low_min": 0.0,
1380
+ "clip_ratio/region_mean": 0.0,
1381
+ "completions/clipped_ratio": 0.35,
1382
+ "completions/max_length": 505.4,
1383
+ "completions/max_terminated_length": 385.4,
1384
+ "completions/mean_length": 360.65,
1385
+ "completions/mean_terminated_length": 295.23333740234375,
1386
+ "completions/min_length": 210.2,
1387
+ "completions/min_terminated_length": 210.2,
1388
+ "entropy": 1.0565216183662414,
1389
+ "epoch": 1.441717791411043,
1390
+ "frac_reward_zero_std": 0.0,
1391
+ "grad_norm": 4.266097068786621,
1392
+ "learning_rate": 1.4110429447852758e-07,
1393
+ "loss": 0.07759050726890564,
1394
+ "num_tokens": 815570.0,
1395
+ "reward": -0.06749999299645423,
1396
+ "reward_std": 0.27374918162822726,
1397
+ "rewards/format_reward/mean": 0.37000001072883604,
1398
+ "rewards/format_reward/std": 0.1865294199436903,
1399
+ "rewards/security_audit_reward/mean": -0.2550000011920929,
1400
+ "rewards/security_audit_reward/std": 0.32802181243896483,
1401
+ "step": 235,
1402
+ "step_time": 38.44030983600023
1403
+ },
1404
+ {
1405
+ "clip_ratio/high_max": 0.0,
1406
+ "clip_ratio/high_mean": 0.0,
1407
+ "clip_ratio/low_mean": 0.0,
1408
+ "clip_ratio/low_min": 0.0,
1409
+ "clip_ratio/region_mean": 0.0,
1410
+ "completions/clipped_ratio": 0.3,
1411
+ "completions/max_length": 495.6,
1412
+ "completions/max_terminated_length": 352.6,
1413
+ "completions/mean_length": 336.05,
1414
+ "completions/mean_terminated_length": 243.60000915527343,
1415
+ "completions/min_length": 145.8,
1416
+ "completions/min_terminated_length": 145.8,
1417
+ "entropy": 1.417020809650421,
1418
+ "epoch": 1.4723926380368098,
1419
+ "frac_reward_zero_std": 0.2,
1420
+ "grad_norm": 4.767548084259033,
1421
+ "learning_rate": 1.334355828220859e-07,
1422
+ "loss": 0.03671485185623169,
1423
+ "num_tokens": 831713.0,
1424
+ "reward": -0.12849999219179153,
1425
+ "reward_std": 0.19381159394979477,
1426
+ "rewards/format_reward/mean": 0.3300000041723251,
1427
+ "rewards/format_reward/std": 0.20168980173766612,
1428
+ "rewards/security_audit_reward/mean": -0.325,
1429
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1430
+ "step": 240,
1431
+ "step_time": 37.49174958900003
1432
+ },
1433
+ {
1434
+ "clip_ratio/high_max": 0.0,
1435
+ "clip_ratio/high_mean": 0.0,
1436
+ "clip_ratio/low_mean": 0.0,
1437
+ "clip_ratio/low_min": 0.0,
1438
+ "clip_ratio/region_mean": 0.0,
1439
+ "completions/clipped_ratio": 0.6,
1440
+ "completions/max_length": 512.0,
1441
+ "completions/max_terminated_length": 343.0,
1442
+ "completions/mean_length": 417.3,
1443
+ "completions/mean_terminated_length": 293.7,
1444
+ "completions/min_length": 246.8,
1445
+ "completions/min_terminated_length": 246.8,
1446
+ "entropy": 1.0088598132133484,
1447
+ "epoch": 1.5030674846625767,
1448
+ "frac_reward_zero_std": 0.0,
1449
+ "grad_norm": 3.1936607360839844,
1450
+ "learning_rate": 1.2576687116564417e-07,
1451
+ "loss": -0.03240810632705689,
1452
+ "num_tokens": 849855.0,
1453
+ "reward": -0.11324999332427979,
1454
+ "reward_std": 0.2602782666683197,
1455
+ "rewards/format_reward/mean": 0.3224999994039536,
1456
+ "rewards/format_reward/std": 0.20757876634597777,
1457
+ "rewards/security_audit_reward/mean": -0.3,
1458
+ "rewards/security_audit_reward/std": 0.3154700517654419,
1459
+ "step": 245,
1460
+ "step_time": 39.07302968719996
1461
+ },
1462
+ {
1463
+ "clip_ratio/high_max": 0.0,
1464
+ "clip_ratio/high_mean": 0.0,
1465
+ "clip_ratio/low_mean": 0.0,
1466
+ "clip_ratio/low_min": 0.0,
1467
+ "clip_ratio/region_mean": 0.0,
1468
+ "completions/clipped_ratio": 0.2,
1469
+ "completions/max_length": 459.6,
1470
+ "completions/max_terminated_length": 428.2,
1471
+ "completions/mean_length": 325.35,
1472
+ "completions/mean_terminated_length": 277.5833374023438,
1473
+ "completions/min_length": 137.0,
1474
+ "completions/min_terminated_length": 137.0,
1475
+ "entropy": 1.1867628961801528,
1476
+ "epoch": 1.5337423312883436,
1477
+ "frac_reward_zero_std": 0.1,
1478
+ "grad_norm": 3.9449942111968994,
1479
+ "learning_rate": 1.1809815950920244e-07,
1480
+ "loss": -0.005416367202997208,
1481
+ "num_tokens": 866330.0,
1482
+ "reward": -0.08199999034404755,
1483
+ "reward_std": 0.2747137784957886,
1484
+ "rewards/format_reward/mean": 0.3099999874830246,
1485
+ "rewards/format_reward/std": 0.17935641929507257,
1486
+ "rewards/security_audit_reward/mean": -0.25,
1487
+ "rewards/security_audit_reward/std": 0.33094010353088377,
1488
+ "step": 250,
1489
+ "step_time": 35.27377968320034
1490
+ },
1491
+ {
1492
+ "clip_ratio/high_max": 0.0,
1493
+ "clip_ratio/high_mean": 0.0,
1494
+ "clip_ratio/low_mean": 0.0,
1495
+ "clip_ratio/low_min": 0.0,
1496
+ "clip_ratio/region_mean": 0.0,
1497
+ "completions/clipped_ratio": 0.25,
1498
+ "completions/max_length": 450.0,
1499
+ "completions/max_terminated_length": 311.2,
1500
+ "completions/mean_length": 283.8,
1501
+ "completions/mean_terminated_length": 202.06666870117186,
1502
+ "completions/min_length": 128.0,
1503
+ "completions/min_terminated_length": 128.0,
1504
+ "entropy": 1.322118791937828,
1505
+ "epoch": 1.5644171779141103,
1506
+ "frac_reward_zero_std": 0.1,
1507
+ "grad_norm": 6.9948039054870605,
1508
+ "learning_rate": 1.1042944785276073e-07,
1509
+ "loss": 0.055895209312438965,
1510
+ "num_tokens": 881650.0,
1511
+ "reward": -0.11924999132752419,
1512
+ "reward_std": 0.14510822538286447,
1513
+ "rewards/format_reward/mean": 0.3024999976158142,
1514
+ "rewards/format_reward/std": 0.15741010159254074,
1515
+ "rewards/security_audit_reward/mean": -0.3,
1516
+ "rewards/security_audit_reward/std": 0.15773502588272095,
1517
+ "step": 255,
1518
+ "step_time": 34.332786842800125
1519
+ },
1520
+ {
1521
+ "clip_ratio/high_max": 0.0,
1522
+ "clip_ratio/high_mean": 0.0,
1523
+ "clip_ratio/low_mean": 0.0,
1524
+ "clip_ratio/low_min": 0.0,
1525
+ "clip_ratio/region_mean": 0.0,
1526
+ "completions/clipped_ratio": 0.3,
1527
+ "completions/max_length": 485.0,
1528
+ "completions/max_terminated_length": 312.2,
1529
+ "completions/mean_length": 312.0,
1530
+ "completions/mean_terminated_length": 225.13333740234376,
1531
+ "completions/min_length": 141.8,
1532
+ "completions/min_terminated_length": 141.8,
1533
+ "entropy": 1.1261488378047944,
1534
+ "epoch": 1.5950920245398774,
1535
+ "frac_reward_zero_std": 0.1,
1536
+ "grad_norm": 5.633542537689209,
1537
+ "learning_rate": 1.0276073619631902e-07,
1538
+ "loss": 0.12422184944152832,
1539
+ "num_tokens": 898170.0,
1540
+ "reward": -0.09424999356269836,
1541
+ "reward_std": 0.2452640563249588,
1542
+ "rewards/format_reward/mean": 0.3274999916553497,
1543
+ "rewards/format_reward/std": 0.19424656331539153,
1544
+ "rewards/security_audit_reward/mean": -0.275,
1545
+ "rewards/security_audit_reward/std": 0.29574271440505984,
1546
+ "step": 260,
1547
+ "step_time": 37.18088837539908
1548
+ },
1549
+ {
1550
+ "clip_ratio/high_max": 0.0,
1551
+ "clip_ratio/high_mean": 0.0,
1552
+ "clip_ratio/low_mean": 0.0,
1553
+ "clip_ratio/low_min": 0.0,
1554
+ "clip_ratio/region_mean": 0.0,
1555
+ "completions/clipped_ratio": 0.3,
1556
+ "completions/max_length": 481.4,
1557
+ "completions/max_terminated_length": 333.8,
1558
+ "completions/mean_length": 323.4,
1559
+ "completions/mean_terminated_length": 233.95,
1560
+ "completions/min_length": 174.6,
1561
+ "completions/min_terminated_length": 174.6,
1562
+ "entropy": 1.3164357602596284,
1563
+ "epoch": 1.6257668711656441,
1564
+ "frac_reward_zero_std": 0.2,
1565
+ "grad_norm": 4.177126884460449,
1566
+ "learning_rate": 9.50920245398773e-08,
1567
+ "loss": -0.0031075358390808107,
1568
+ "num_tokens": 914438.0,
1569
+ "reward": -0.109499990940094,
1570
+ "reward_std": 0.19355954378843307,
1571
+ "rewards/format_reward/mean": 0.3700000077486038,
1572
+ "rewards/format_reward/std": 0.19431518614292145,
1573
+ "rewards/security_audit_reward/mean": -0.31500000059604644,
1574
+ "rewards/security_audit_reward/std": 0.21745660305023193,
1575
+ "step": 265,
1576
+ "step_time": 36.75480746599969
1577
+ },
1578
+ {
1579
+ "clip_ratio/high_max": 0.0,
1580
+ "clip_ratio/high_mean": 0.0,
1581
+ "clip_ratio/low_mean": 0.0,
1582
+ "clip_ratio/low_min": 0.0,
1583
+ "clip_ratio/region_mean": 0.0,
1584
+ "completions/clipped_ratio": 0.15,
1585
+ "completions/max_length": 460.2,
1586
+ "completions/max_terminated_length": 279.2,
1587
+ "completions/mean_length": 224.9,
1588
+ "completions/mean_terminated_length": 170.45000305175782,
1589
+ "completions/min_length": 84.4,
1590
+ "completions/min_terminated_length": 84.4,
1591
+ "entropy": 1.267154586315155,
1592
+ "epoch": 1.656441717791411,
1593
+ "frac_reward_zero_std": 0.2,
1594
+ "grad_norm": 7.552680015563965,
1595
+ "learning_rate": 8.742331288343557e-08,
1596
+ "loss": -0.12154214382171631,
1597
+ "num_tokens": 928536.0,
1598
+ "reward": -0.07999998778104782,
1599
+ "reward_std": 0.16925212144851684,
1600
+ "rewards/format_reward/mean": 0.37499999403953554,
1601
+ "rewards/format_reward/std": 0.1521439790725708,
1602
+ "rewards/security_audit_reward/mean": -0.275,
1603
+ "rewards/security_audit_reward/std": 0.20773502588272094,
1604
+ "step": 270,
1605
+ "step_time": 35.04094967719975
1606
+ },
1607
+ {
1608
+ "clip_ratio/high_max": 0.0,
1609
+ "clip_ratio/high_mean": 0.0,
1610
+ "clip_ratio/low_mean": 0.0,
1611
+ "clip_ratio/low_min": 0.0,
1612
+ "clip_ratio/region_mean": 0.0,
1613
+ "completions/clipped_ratio": 0.3,
1614
+ "completions/max_length": 469.4,
1615
+ "completions/max_terminated_length": 287.6,
1616
+ "completions/mean_length": 300.95,
1617
+ "completions/mean_terminated_length": 204.76667175292968,
1618
+ "completions/min_length": 130.0,
1619
+ "completions/min_terminated_length": 130.0,
1620
+ "entropy": 1.1936017721891403,
1621
+ "epoch": 1.687116564417178,
1622
+ "frac_reward_zero_std": 0.2,
1623
+ "grad_norm": 7.106525421142578,
1624
+ "learning_rate": 7.975460122699386e-08,
1625
+ "loss": -0.0458857923746109,
1626
+ "num_tokens": 944307.0,
1627
+ "reward": -0.05174999088048935,
1628
+ "reward_std": 0.23333178758621215,
1629
+ "rewards/format_reward/mean": 0.3874999940395355,
1630
+ "rewards/format_reward/std": 0.16559004038572311,
1631
+ "rewards/security_audit_reward/mean": -0.24000000059604645,
1632
+ "rewards/security_audit_reward/std": 0.2866505742073059,
1633
+ "step": 275,
1634
+ "step_time": 36.0470411268001
1635
+ },
1636
+ {
1637
+ "clip_ratio/high_max": 0.0,
1638
+ "clip_ratio/high_mean": 0.0,
1639
+ "clip_ratio/low_mean": 0.0,
1640
+ "clip_ratio/low_min": 0.0,
1641
+ "clip_ratio/region_mean": 0.0,
1642
+ "completions/clipped_ratio": 0.2,
1643
+ "completions/max_length": 498.6,
1644
+ "completions/max_terminated_length": 375.0,
1645
+ "completions/mean_length": 295.25,
1646
+ "completions/mean_terminated_length": 232.6666717529297,
1647
+ "completions/min_length": 97.8,
1648
+ "completions/min_terminated_length": 97.8,
1649
+ "entropy": 1.460896384716034,
1650
+ "epoch": 1.7177914110429446,
1651
+ "frac_reward_zero_std": 0.0,
1652
+ "grad_norm": 5.004129409790039,
1653
+ "learning_rate": 7.208588957055214e-08,
1654
+ "loss": -0.10658804178237916,
1655
+ "num_tokens": 960078.0,
1656
+ "reward": -0.01824999153614044,
1657
+ "reward_std": 0.2521414369344711,
1658
+ "rewards/format_reward/mean": 0.3825000047683716,
1659
+ "rewards/format_reward/std": 0.15327396541833876,
1660
+ "rewards/security_audit_reward/mean": -0.1899999976158142,
1661
+ "rewards/security_audit_reward/std": 0.3234777390956879,
1662
+ "step": 280,
1663
+ "step_time": 37.26412723539943
1664
+ },
1665
+ {
1666
+ "clip_ratio/high_max": 0.0,
1667
+ "clip_ratio/high_mean": 0.0,
1668
+ "clip_ratio/low_mean": 0.0,
1669
+ "clip_ratio/low_min": 0.0,
1670
+ "clip_ratio/region_mean": 0.0,
1671
+ "completions/clipped_ratio": 0.3,
1672
+ "completions/max_length": 501.0,
1673
+ "completions/max_terminated_length": 363.2,
1674
+ "completions/mean_length": 340.6,
1675
+ "completions/mean_terminated_length": 257.06666870117186,
1676
+ "completions/min_length": 146.2,
1677
+ "completions/min_terminated_length": 146.2,
1678
+ "entropy": 1.0431257128715514,
1679
+ "epoch": 1.7484662576687118,
1680
+ "frac_reward_zero_std": 0.1,
1681
+ "grad_norm": 4.059199333190918,
1682
+ "learning_rate": 6.441717791411043e-08,
1683
+ "loss": -0.10386581420898437,
1684
+ "num_tokens": 976992.0,
1685
+ "reward": -0.06924999207258224,
1686
+ "reward_std": 0.2484972782433033,
1687
+ "rewards/format_reward/mean": 0.38750000596046447,
1688
+ "rewards/format_reward/std": 0.1250488668680191,
1689
+ "rewards/security_audit_reward/mean": -0.26500000059604645,
1690
+ "rewards/security_audit_reward/std": 0.3080150008201599,
1691
+ "step": 285,
1692
+ "step_time": 38.17630281240017
1693
+ },
1694
+ {
1695
+ "clip_ratio/high_max": 0.0,
1696
+ "clip_ratio/high_mean": 0.0,
1697
+ "clip_ratio/low_mean": 0.0,
1698
+ "clip_ratio/low_min": 0.0,
1699
+ "clip_ratio/region_mean": 0.0,
1700
+ "completions/clipped_ratio": 0.05,
1701
+ "completions/max_length": 435.0,
1702
+ "completions/max_terminated_length": 413.6,
1703
+ "completions/mean_length": 281.65,
1704
+ "completions/mean_terminated_length": 274.0,
1705
+ "completions/min_length": 152.0,
1706
+ "completions/min_terminated_length": 152.0,
1707
+ "entropy": 1.1775987446308136,
1708
+ "epoch": 1.7791411042944785,
1709
+ "frac_reward_zero_std": 0.3,
1710
+ "grad_norm": 4.12350606918335,
1711
+ "learning_rate": 5.674846625766871e-08,
1712
+ "loss": -0.03856886327266693,
1713
+ "num_tokens": 992819.0,
1714
+ "reward": -0.05999999046325684,
1715
+ "reward_std": 0.1456713281571865,
1716
+ "rewards/format_reward/mean": 0.360000005364418,
1717
+ "rewards/format_reward/std": 0.12044776938855647,
1718
+ "rewards/security_audit_reward/mean": -0.23999999985098838,
1719
+ "rewards/security_audit_reward/std": 0.17773502618074416,
1720
+ "step": 290,
1721
+ "step_time": 33.279480656001034
1722
+ },
1723
+ {
1724
+ "clip_ratio/high_max": 0.0,
1725
+ "clip_ratio/high_mean": 0.0,
1726
+ "clip_ratio/low_mean": 0.0,
1727
+ "clip_ratio/low_min": 0.0,
1728
+ "clip_ratio/region_mean": 0.0,
1729
+ "completions/clipped_ratio": 0.15,
1730
+ "completions/max_length": 470.0,
1731
+ "completions/max_terminated_length": 383.0,
1732
+ "completions/mean_length": 306.15,
1733
+ "completions/mean_terminated_length": 264.66666717529296,
1734
+ "completions/min_length": 170.8,
1735
+ "completions/min_terminated_length": 170.8,
1736
+ "entropy": 1.3437508165836334,
1737
+ "epoch": 1.8098159509202454,
1738
+ "frac_reward_zero_std": 0.0,
1739
+ "grad_norm": 5.014013290405273,
1740
+ "learning_rate": 4.907975460122699e-08,
1741
+ "loss": 0.1800641179084778,
1742
+ "num_tokens": 1008676.0,
1743
+ "reward": -0.14524998962879182,
1744
+ "reward_std": 0.19439554661512376,
1745
+ "rewards/format_reward/mean": 0.33249999284744264,
1746
+ "rewards/format_reward/std": 0.21220951080322265,
1747
+ "rewards/security_audit_reward/mean": -0.35,
1748
+ "rewards/security_audit_reward/std": 0.2154700517654419,
1749
+ "step": 295,
1750
+ "step_time": 35.92737517459973
1751
+ },
1752
+ {
1753
+ "clip_ratio/high_max": 0.0,
1754
+ "clip_ratio/high_mean": 0.0,
1755
+ "clip_ratio/low_mean": 0.0,
1756
+ "clip_ratio/low_min": 0.0,
1757
+ "clip_ratio/region_mean": 0.0,
1758
+ "completions/clipped_ratio": 0.3,
1759
+ "completions/max_length": 471.6,
1760
+ "completions/max_terminated_length": 299.6,
1761
+ "completions/mean_length": 360.45,
1762
+ "completions/mean_terminated_length": 234.9166687011719,
1763
+ "completions/min_length": 229.0,
1764
+ "completions/min_terminated_length": 126.6,
1765
+ "entropy": 1.1840724140405654,
1766
+ "epoch": 1.8404907975460123,
1767
+ "frac_reward_zero_std": 0.1,
1768
+ "grad_norm": 2.570652484893799,
1769
+ "learning_rate": 4.1411042944785274e-08,
1770
+ "loss": 0.019638296961784363,
1771
+ "num_tokens": 1025285.0,
1772
+ "reward": -0.06699999049305916,
1773
+ "reward_std": 0.2400740846991539,
1774
+ "rewards/format_reward/mean": 0.3600000023841858,
1775
+ "rewards/format_reward/std": 0.1531308189034462,
1776
+ "rewards/security_audit_reward/mean": -0.25,
1777
+ "rewards/security_audit_reward/std": 0.2868344783782959,
1778
+ "step": 300,
1779
+ "step_time": 35.41554451999982
1780
+ },
1781
+ {
1782
+ "clip_ratio/high_max": 0.0,
1783
+ "clip_ratio/high_mean": 0.0,
1784
+ "clip_ratio/low_mean": 0.0,
1785
+ "clip_ratio/low_min": 0.0,
1786
+ "clip_ratio/region_mean": 0.0,
1787
+ "completions/clipped_ratio": 0.3,
1788
+ "completions/max_length": 491.0,
1789
+ "completions/max_terminated_length": 357.4,
1790
+ "completions/mean_length": 330.65,
1791
+ "completions/mean_terminated_length": 249.31666870117186,
1792
+ "completions/min_length": 125.0,
1793
+ "completions/min_terminated_length": 125.0,
1794
+ "entropy": 1.2406673014163971,
1795
+ "epoch": 1.871165644171779,
1796
+ "frac_reward_zero_std": 0.0,
1797
+ "grad_norm": 4.481863975524902,
1798
+ "learning_rate": 3.3742331288343556e-08,
1799
+ "loss": 0.2060640573501587,
1800
+ "num_tokens": 1041362.0,
1801
+ "reward": -0.005999994277954101,
1802
+ "reward_std": 0.23084985613822936,
1803
+ "rewards/format_reward/mean": 0.4000000059604645,
1804
+ "rewards/format_reward/std": 0.12440616972744464,
1805
+ "rewards/security_audit_reward/mean": -0.1800000011920929,
1806
+ "rewards/security_audit_reward/std": 0.2963721513748169,
1807
+ "step": 305,
1808
+ "step_time": 37.082924159199685
1809
+ },
1810
+ {
1811
+ "clip_ratio/high_max": 0.0,
1812
+ "clip_ratio/high_mean": 0.0,
1813
+ "clip_ratio/low_mean": 0.0,
1814
+ "clip_ratio/low_min": 0.0,
1815
+ "clip_ratio/region_mean": 0.0,
1816
+ "completions/clipped_ratio": 0.25,
1817
+ "completions/max_length": 440.8,
1818
+ "completions/max_terminated_length": 313.2,
1819
+ "completions/mean_length": 296.75,
1820
+ "completions/mean_terminated_length": 235.5666748046875,
1821
+ "completions/min_length": 162.4,
1822
+ "completions/min_terminated_length": 162.4,
1823
+ "entropy": 1.4028007209300994,
1824
+ "epoch": 1.9018404907975461,
1825
+ "frac_reward_zero_std": 0.1,
1826
+ "grad_norm": 3.5625224113464355,
1827
+ "learning_rate": 2.607361963190184e-08,
1828
+ "loss": -0.09365988969802856,
1829
+ "num_tokens": 1056493.0,
1830
+ "reward": -0.07374998778104783,
1831
+ "reward_std": 0.17730526700615884,
1832
+ "rewards/format_reward/mean": 0.3725000023841858,
1833
+ "rewards/format_reward/std": 0.1375160299241543,
1834
+ "rewards/security_audit_reward/mean": -0.26500000059604645,
1835
+ "rewards/security_audit_reward/std": 0.21745660305023193,
1836
+ "step": 310,
1837
+ "step_time": 32.90120237959964
1838
+ },
1839
+ {
1840
+ "clip_ratio/high_max": 0.0,
1841
+ "clip_ratio/high_mean": 0.0,
1842
+ "clip_ratio/low_mean": 0.0,
1843
+ "clip_ratio/low_min": 0.0,
1844
+ "clip_ratio/region_mean": 0.0,
1845
+ "completions/clipped_ratio": 0.35,
1846
+ "completions/max_length": 508.8,
1847
+ "completions/max_terminated_length": 354.0,
1848
+ "completions/mean_length": 356.8,
1849
+ "completions/mean_terminated_length": 257.8333343505859,
1850
+ "completions/min_length": 172.4,
1851
+ "completions/min_terminated_length": 172.4,
1852
+ "entropy": 1.2189550220966339,
1853
+ "epoch": 1.9325153374233128,
1854
+ "frac_reward_zero_std": 0.2,
1855
+ "grad_norm": 4.627664089202881,
1856
+ "learning_rate": 1.8404907975460124e-08,
1857
+ "loss": -0.043705222010612485,
1858
+ "num_tokens": 1073209.0,
1859
+ "reward": -0.10774998962879181,
1860
+ "reward_std": 0.1972955085337162,
1861
+ "rewards/format_reward/mean": 0.35250000059604647,
1862
+ "rewards/format_reward/std": 0.13575982302427292,
1863
+ "rewards/security_audit_reward/mean": -0.3050000011920929,
1864
+ "rewards/security_audit_reward/std": 0.25009607076644896,
1865
+ "step": 315,
1866
+ "step_time": 38.71515165839992
1867
+ },
1868
+ {
1869
+ "clip_ratio/high_max": 0.0,
1870
+ "clip_ratio/high_mean": 0.0,
1871
+ "clip_ratio/low_mean": 0.0,
1872
+ "clip_ratio/low_min": 0.0,
1873
+ "clip_ratio/region_mean": 0.0,
1874
+ "completions/clipped_ratio": 0.3,
1875
+ "completions/max_length": 512.0,
1876
+ "completions/max_terminated_length": 394.2,
1877
+ "completions/mean_length": 341.55,
1878
+ "completions/mean_terminated_length": 271.6666717529297,
1879
+ "completions/min_length": 180.4,
1880
+ "completions/min_terminated_length": 180.4,
1881
+ "entropy": 1.1277358770370483,
1882
+ "epoch": 1.9631901840490797,
1883
+ "frac_reward_zero_std": 0.3,
1884
+ "grad_norm": 3.979893207550049,
1885
+ "learning_rate": 1.0736196319018405e-08,
1886
+ "loss": -0.07816079258918762,
1887
+ "num_tokens": 1089918.0,
1888
+ "reward": -0.08449999019503593,
1889
+ "reward_std": 0.14596682507544756,
1890
+ "rewards/format_reward/mean": 0.3949999988079071,
1891
+ "rewards/format_reward/std": 0.13996364884078502,
1892
+ "rewards/security_audit_reward/mean": -0.29000000059604647,
1893
+ "rewards/security_audit_reward/std": 0.17118052244186402,
1894
+ "step": 320,
1895
+ "step_time": 39.41049809280048
1896
+ },
1897
+ {
1898
+ "clip_ratio/high_max": 0.0,
1899
+ "clip_ratio/high_mean": 0.0,
1900
+ "clip_ratio/low_mean": 0.0,
1901
+ "clip_ratio/low_min": 0.0,
1902
+ "clip_ratio/region_mean": 0.0,
1903
+ "completions/clipped_ratio": 0.15,
1904
+ "completions/max_length": 488.4,
1905
+ "completions/max_terminated_length": 420.6,
1906
+ "completions/mean_length": 319.75,
1907
+ "completions/mean_terminated_length": 284.56667175292966,
1908
+ "completions/min_length": 199.2,
1909
+ "completions/min_terminated_length": 199.2,
1910
+ "entropy": 1.3403348803520203,
1911
+ "epoch": 1.9938650306748467,
1912
+ "frac_reward_zero_std": 0.2,
1913
+ "grad_norm": 3.1303930282592773,
1914
+ "learning_rate": 3.067484662576687e-09,
1915
+ "loss": -0.08682631254196167,
1916
+ "num_tokens": 1105841.0,
1917
+ "reward": -0.05374999940395355,
1918
+ "reward_std": 0.21490582572296263,
1919
+ "rewards/format_reward/mean": 0.2875,
1920
+ "rewards/format_reward/std": 0.19336618185043336,
1921
+ "rewards/security_audit_reward/mean": -0.2,
1922
+ "rewards/security_audit_reward/std": 0.22739237546920776,
1923
+ "step": 325,
1924
+ "step_time": 37.026589032400445
1925
+ }
1926
+ ],
1927
+ "logging_steps": 5,
1928
+ "max_steps": 326,
1929
+ "num_input_tokens_seen": 1108991,
1930
+ "num_train_epochs": 2,
1931
+ "save_steps": 50,
1932
+ "stateful_callbacks": {
1933
+ "TrainerControl": {
1934
+ "args": {
1935
+ "should_epoch_stop": false,
1936
+ "should_evaluate": false,
1937
+ "should_log": false,
1938
+ "should_save": true,
1939
+ "should_training_stop": true
1940
+ },
1941
+ "attributes": {}
1942
+ }
1943
+ },
1944
+ "total_flos": 0.0,
1945
+ "train_batch_size": 2,
1946
+ "trial_name": null,
1947
+ "trial_params": null
1948
+ }
checkpoint-326/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b51f3856815802830b7add9e23ddd089207e5c9941078dd606f120af0f983d09
3
+ size 6776
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:da48f2616286a0c7d4ceeeadcd3221c2d7381581ab8ec32e6ad58d13b0f6629a
3
  size 1976163472
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c64421ad1b2b08b8f84687a636657a68a9f6c9ef639c6c2dc449cb93d2c4219
3
  size 1976163472
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b51f3856815802830b7add9e23ddd089207e5c9941078dd606f120af0f983d09
3
+ size 6776