Lansechen commited on
Commit
1eaeb3a
·
verified ·
1 Parent(s): 4c6ab0b

Model save

Browse files
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-7B
3
+ library_name: transformers
4
+ model_name: Qwen2.5-7B-Open-R1-GRPO-math-lighteval-log
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - grpo
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for Qwen2.5-7B-Open-R1-GRPO-math-lighteval-log
13
+
14
+ This model is a fine-tuned version of [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="Lansechen/Qwen2.5-7B-Open-R1-GRPO-math-lighteval-log", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/chenran1995-the-chinese-university-of-hong-kong/huggingface/runs/idxxkq28)
31
+
32
+
33
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.16.0
38
+ - Transformers: 4.49.0
39
+ - Pytorch: 2.5.1+cu121
40
+ - Datasets: 3.3.1
41
+ - Tokenizers: 0.21.0
42
+
43
+ ## Citations
44
+
45
+ Cite GRPO as:
46
+
47
+ ```bibtex
48
+ @article{zhihong2024deepseekmath,
49
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
50
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
51
+ year = 2024,
52
+ eprint = {arXiv:2402.03300},
53
+ }
54
+
55
+ ```
56
+
57
+ Cite TRL as:
58
+
59
+ ```bibtex
60
+ @misc{vonwerra2022trl,
61
+ title = {{TRL: Transformer Reinforcement Learning}},
62
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
63
+ year = 2020,
64
+ journal = {GitHub repository},
65
+ publisher = {GitHub},
66
+ howpublished = {\url{https://github.com/huggingface/trl}}
67
+ }
68
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.04579740530018937,
4
+ "train_runtime": 16666.94,
5
+ "train_samples": 7500,
6
+ "train_samples_per_second": 0.9,
7
+ "train_steps_per_second": 0.008
8
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "eos_token_id": 151643,
4
+ "max_new_tokens": 2048,
5
+ "transformers_version": "4.49.0"
6
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.04579740530018937,
4
+ "train_runtime": 16666.94,
5
+ "train_samples": 7500,
6
+ "train_samples_per_second": 0.9,
7
+ "train_steps_per_second": 0.008
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,2030 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.9850746268656716,
5
+ "eval_steps": 100,
6
+ "global_step": 132,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "clip_ratio": 0.0,
13
+ "completion_length": 823.762321472168,
14
+ "epoch": 0.014925373134328358,
15
+ "grad_norm": 0.17830021679401398,
16
+ "learning_rate": 7.142857142857142e-08,
17
+ "loss": 0.0919,
18
+ "num_tokens": 865267.0,
19
+ "reward": 2.096768334507942,
20
+ "reward_std": 0.32721264474093914,
21
+ "rewards/accuracy_reward": 0.7645089328289032,
22
+ "rewards/format_reward": 0.9899553433060646,
23
+ "rewards/log_scaled_reward": 0.3423039447516203,
24
+ "step": 1
25
+ },
26
+ {
27
+ "clip_ratio": 0.0,
28
+ "completion_length": 465.7957763671875,
29
+ "epoch": 0.029850746268656716,
30
+ "grad_norm": 0.5716150403022766,
31
+ "learning_rate": 1.4285714285714285e-07,
32
+ "loss": 0.1518,
33
+ "num_tokens": 1414092.0,
34
+ "reward": 0.1841081934981048,
35
+ "reward_std": 0.6422095634043217,
36
+ "rewards/accuracy_reward": 0.23437499813735485,
37
+ "rewards/format_reward": 0.05022321571595967,
38
+ "rewards/log_scaled_reward": -0.10049002850428224,
39
+ "step": 2
40
+ },
41
+ {
42
+ "clip_ratio": 0.0,
43
+ "completion_length": 482.07144927978516,
44
+ "epoch": 0.04477611940298507,
45
+ "grad_norm": 0.532157301902771,
46
+ "learning_rate": 2.1428571428571426e-07,
47
+ "loss": 0.1335,
48
+ "num_tokens": 1992932.0,
49
+ "reward": 0.10916334297508001,
50
+ "reward_std": 0.6189606413245201,
51
+ "rewards/accuracy_reward": 0.20647321455180645,
52
+ "rewards/format_reward": 0.053571428754366934,
53
+ "rewards/log_scaled_reward": -0.1508813016116619,
54
+ "step": 3
55
+ },
56
+ {
57
+ "clip_ratio": 0.0,
58
+ "completion_length": 529.5971221923828,
59
+ "epoch": 0.05970149253731343,
60
+ "grad_norm": 0.42384153604507446,
61
+ "learning_rate": 2.857142857142857e-07,
62
+ "loss": 0.2014,
63
+ "num_tokens": 2596963.0,
64
+ "reward": 0.05200400925241411,
65
+ "reward_std": 0.582377951592207,
66
+ "rewards/accuracy_reward": 0.18191964272409678,
67
+ "rewards/format_reward": 0.04464285809081048,
68
+ "rewards/log_scaled_reward": -0.17455849051475525,
69
+ "step": 4
70
+ },
71
+ {
72
+ "clip_ratio": 0.0,
73
+ "completion_length": 490.2690010070801,
74
+ "epoch": 0.07462686567164178,
75
+ "grad_norm": 0.42382651567459106,
76
+ "learning_rate": 3.5714285714285716e-07,
77
+ "loss": 0.1441,
78
+ "num_tokens": 3173748.0,
79
+ "reward": 0.10420671524479985,
80
+ "reward_std": 0.5982032977044582,
81
+ "rewards/accuracy_reward": 0.18750000279396772,
82
+ "rewards/format_reward": 0.07366071455180645,
83
+ "rewards/log_scaled_reward": -0.15695400722324848,
84
+ "step": 5
85
+ },
86
+ {
87
+ "clip_ratio": 0.0,
88
+ "completion_length": 450.69086837768555,
89
+ "epoch": 0.08955223880597014,
90
+ "grad_norm": 0.5930183529853821,
91
+ "learning_rate": 4.285714285714285e-07,
92
+ "loss": 0.12,
93
+ "num_tokens": 3702055.0,
94
+ "reward": 0.16118104895576835,
95
+ "reward_std": 0.6403544321656227,
96
+ "rewards/accuracy_reward": 0.20982143096625805,
97
+ "rewards/format_reward": 0.07142857182770967,
98
+ "rewards/log_scaled_reward": -0.12006895546801388,
99
+ "step": 6
100
+ },
101
+ {
102
+ "clip_ratio": 0.0,
103
+ "completion_length": 469.174129486084,
104
+ "epoch": 0.1044776119402985,
105
+ "grad_norm": 0.5980175137519836,
106
+ "learning_rate": 5e-07,
107
+ "loss": 0.1409,
108
+ "num_tokens": 4251483.0,
109
+ "reward": 0.18134657479822636,
110
+ "reward_std": 0.680236354470253,
111
+ "rewards/accuracy_reward": 0.20982142724096775,
112
+ "rewards/format_reward": 0.09821428637951612,
113
+ "rewards/log_scaled_reward": -0.12668914068490267,
114
+ "step": 7
115
+ },
116
+ {
117
+ "clip_ratio": 0.0,
118
+ "completion_length": 501.63505935668945,
119
+ "epoch": 0.11940298507462686,
120
+ "grad_norm": 0.7299549579620361,
121
+ "learning_rate": 5.714285714285714e-07,
122
+ "loss": 0.1195,
123
+ "num_tokens": 4818596.0,
124
+ "reward": 0.20046980120241642,
125
+ "reward_std": 0.67672199010849,
126
+ "rewards/accuracy_reward": 0.2120535708963871,
127
+ "rewards/format_reward": 0.1183035708963871,
128
+ "rewards/log_scaled_reward": -0.1298873471096158,
129
+ "step": 8
130
+ },
131
+ {
132
+ "clip_ratio": 0.0,
133
+ "completion_length": 495.338191986084,
134
+ "epoch": 0.13432835820895522,
135
+ "grad_norm": 0.4346173405647278,
136
+ "learning_rate": 6.428571428571429e-07,
137
+ "loss": 0.1516,
138
+ "num_tokens": 5387755.0,
139
+ "reward": 0.2097643855959177,
140
+ "reward_std": 0.6800287291407585,
141
+ "rewards/accuracy_reward": 0.2220982164144516,
142
+ "rewards/format_reward": 0.10937500093132257,
143
+ "rewards/log_scaled_reward": -0.12170884059742093,
144
+ "step": 9
145
+ },
146
+ {
147
+ "clip_ratio": 0.0,
148
+ "completion_length": 477.33930587768555,
149
+ "epoch": 0.14925373134328357,
150
+ "grad_norm": 3.780118942260742,
151
+ "learning_rate": 7.142857142857143e-07,
152
+ "loss": 0.0884,
153
+ "num_tokens": 5940091.0,
154
+ "reward": 0.31957440078258514,
155
+ "reward_std": 0.7821567356586456,
156
+ "rewards/accuracy_reward": 0.2500000037252903,
157
+ "rewards/format_reward": 0.16183035727590322,
158
+ "rewards/log_scaled_reward": -0.092255964060314,
159
+ "step": 10
160
+ },
161
+ {
162
+ "clip_ratio": 0.0,
163
+ "completion_length": 529.9196739196777,
164
+ "epoch": 0.16417910447761194,
165
+ "grad_norm": 0.7230793833732605,
166
+ "learning_rate": 7.857142857142856e-07,
167
+ "loss": 0.0843,
168
+ "num_tokens": 6544955.0,
169
+ "reward": 0.2673153153154999,
170
+ "reward_std": 0.7341821119189262,
171
+ "rewards/accuracy_reward": 0.21316964086145163,
172
+ "rewards/format_reward": 0.1897321417927742,
173
+ "rewards/log_scaled_reward": -0.13558648666366935,
174
+ "step": 11
175
+ },
176
+ {
177
+ "clip_ratio": 0.0,
178
+ "completion_length": 498.72323989868164,
179
+ "epoch": 0.1791044776119403,
180
+ "grad_norm": 0.9150818586349487,
181
+ "learning_rate": 8.57142857142857e-07,
182
+ "loss": 0.1105,
183
+ "num_tokens": 7121891.0,
184
+ "reward": 0.34613738395273685,
185
+ "reward_std": 0.7445657253265381,
186
+ "rewards/accuracy_reward": 0.2388392873108387,
187
+ "rewards/format_reward": 0.22544642630964518,
188
+ "rewards/log_scaled_reward": -0.11814834456890821,
189
+ "step": 12
190
+ },
191
+ {
192
+ "clip_ratio": 0.0,
193
+ "completion_length": 483.24221420288086,
194
+ "epoch": 0.19402985074626866,
195
+ "grad_norm": 2.0404834747314453,
196
+ "learning_rate": 9.285714285714285e-07,
197
+ "loss": 0.0535,
198
+ "num_tokens": 7678060.0,
199
+ "reward": 0.5936555862426758,
200
+ "reward_std": 0.7919039279222488,
201
+ "rewards/accuracy_reward": 0.2667410708963871,
202
+ "rewards/format_reward": 0.4084821417927742,
203
+ "rewards/log_scaled_reward": -0.08156766439788043,
204
+ "step": 13
205
+ },
206
+ {
207
+ "clip_ratio": 0.0,
208
+ "completion_length": 416.2042541503906,
209
+ "epoch": 0.208955223880597,
210
+ "grad_norm": 3.1881885528564453,
211
+ "learning_rate": 1e-06,
212
+ "loss": 0.0238,
213
+ "num_tokens": 8174851.0,
214
+ "reward": 0.6864497661590576,
215
+ "reward_std": 0.8334432542324066,
216
+ "rewards/accuracy_reward": 0.2354910708963871,
217
+ "rewards/format_reward": 0.5345982126891613,
218
+ "rewards/log_scaled_reward": -0.0836395358783193,
219
+ "step": 14
220
+ },
221
+ {
222
+ "clip_ratio": 0.0,
223
+ "completion_length": 402.8460006713867,
224
+ "epoch": 0.22388059701492538,
225
+ "grad_norm": 1.9057585000991821,
226
+ "learning_rate": 9.998286624877785e-07,
227
+ "loss": 0.0362,
228
+ "num_tokens": 8650305.0,
229
+ "reward": 0.7661240547895432,
230
+ "reward_std": 0.8178009614348412,
231
+ "rewards/accuracy_reward": 0.22656249813735485,
232
+ "rewards/format_reward": 0.6272321343421936,
233
+ "rewards/log_scaled_reward": -0.08767060440732166,
234
+ "step": 15
235
+ },
236
+ {
237
+ "clip_ratio": 0.0,
238
+ "completion_length": 452.4196662902832,
239
+ "epoch": 0.23880597014925373,
240
+ "grad_norm": 25.345657348632812,
241
+ "learning_rate": 9.99314767377287e-07,
242
+ "loss": 0.0285,
243
+ "num_tokens": 9179041.0,
244
+ "reward": 0.875191293656826,
245
+ "reward_std": 0.7929042428731918,
246
+ "rewards/accuracy_reward": 0.25781249813735485,
247
+ "rewards/format_reward": 0.6964285671710968,
248
+ "rewards/log_scaled_reward": -0.07904981379397213,
249
+ "step": 16
250
+ },
251
+ {
252
+ "clip_ratio": 0.0,
253
+ "completion_length": 416.35605239868164,
254
+ "epoch": 0.2537313432835821,
255
+ "grad_norm": 1.1184340715408325,
256
+ "learning_rate": 9.98458666866564e-07,
257
+ "loss": 0.0563,
258
+ "num_tokens": 9701832.0,
259
+ "reward": 0.9071941375732422,
260
+ "reward_std": 0.8025857880711555,
261
+ "rewards/accuracy_reward": 0.24999999813735485,
262
+ "rewards/format_reward": 0.7265625,
263
+ "rewards/log_scaled_reward": -0.06936839601257816,
264
+ "step": 17
265
+ },
266
+ {
267
+ "clip_ratio": 0.0,
268
+ "completion_length": 415.1797065734863,
269
+ "epoch": 0.26865671641791045,
270
+ "grad_norm": 0.4797329604625702,
271
+ "learning_rate": 9.972609476841365e-07,
272
+ "loss": 0.1162,
273
+ "num_tokens": 10210017.0,
274
+ "reward": 0.9835792705416679,
275
+ "reward_std": 0.7636988162994385,
276
+ "rewards/accuracy_reward": 0.24665178544819355,
277
+ "rewards/format_reward": 0.8147321343421936,
278
+ "rewards/log_scaled_reward": -0.07780469593126327,
279
+ "step": 18
280
+ },
281
+ {
282
+ "clip_ratio": 0.0,
283
+ "completion_length": 441.59042739868164,
284
+ "epoch": 0.2835820895522388,
285
+ "grad_norm": 0.4748988151550293,
286
+ "learning_rate": 9.957224306869053e-07,
287
+ "loss": 0.0578,
288
+ "num_tokens": 10730474.0,
289
+ "reward": 1.0904420465230942,
290
+ "reward_std": 0.80109953135252,
291
+ "rewards/accuracy_reward": 0.300223208963871,
292
+ "rewards/format_reward": 0.8158482164144516,
293
+ "rewards/log_scaled_reward": -0.025629449490224943,
294
+ "step": 19
295
+ },
296
+ {
297
+ "clip_ratio": 0.0,
298
+ "completion_length": 426.41743087768555,
299
+ "epoch": 0.29850746268656714,
300
+ "grad_norm": 0.47011756896972656,
301
+ "learning_rate": 9.938441702975689e-07,
302
+ "loss": 0.0503,
303
+ "num_tokens": 11239824.0,
304
+ "reward": 1.2500263825058937,
305
+ "reward_std": 0.8551982864737511,
306
+ "rewards/accuracy_reward": 0.3671875,
307
+ "rewards/format_reward": 0.8526785746216774,
308
+ "rewards/log_scaled_reward": 0.030160245776642114,
309
+ "step": 20
310
+ },
311
+ {
312
+ "clip_ratio": 0.0,
313
+ "completion_length": 459.67078399658203,
314
+ "epoch": 0.31343283582089554,
315
+ "grad_norm": 0.37291309237480164,
316
+ "learning_rate": 9.916274537819773e-07,
317
+ "loss": 0.0366,
318
+ "num_tokens": 11776161.0,
319
+ "reward": 1.3390378654003143,
320
+ "reward_std": 0.8277674093842506,
321
+ "rewards/accuracy_reward": 0.4196428544819355,
322
+ "rewards/format_reward": 0.8526785746216774,
323
+ "rewards/log_scaled_reward": 0.06671636505052447,
324
+ "step": 21
325
+ },
326
+ {
327
+ "clip_ratio": 0.0,
328
+ "completion_length": 416.93639755249023,
329
+ "epoch": 0.3283582089552239,
330
+ "grad_norm": 0.4895838499069214,
331
+ "learning_rate": 9.890738003669027e-07,
332
+ "loss": 0.0473,
333
+ "num_tokens": 12279200.0,
334
+ "reward": 1.5237962007522583,
335
+ "reward_std": 0.8290813863277435,
336
+ "rewards/accuracy_reward": 0.4810267835855484,
337
+ "rewards/format_reward": 0.8950892835855484,
338
+ "rewards/log_scaled_reward": 0.14768002880737185,
339
+ "step": 22
340
+ },
341
+ {
342
+ "clip_ratio": 0.0,
343
+ "completion_length": 409.1294860839844,
344
+ "epoch": 0.34328358208955223,
345
+ "grad_norm": 0.4449058175086975,
346
+ "learning_rate": 9.861849601988383e-07,
347
+ "loss": 0.0255,
348
+ "num_tokens": 12776356.0,
349
+ "reward": 1.5605345666408539,
350
+ "reward_std": 0.8202421888709068,
351
+ "rewards/accuracy_reward": 0.488839291036129,
352
+ "rewards/format_reward": 0.9151785746216774,
353
+ "rewards/log_scaled_reward": 0.15651662228628993,
354
+ "step": 23
355
+ },
356
+ {
357
+ "clip_ratio": 0.0,
358
+ "completion_length": 487.7277069091797,
359
+ "epoch": 0.3582089552238806,
360
+ "grad_norm": 0.33160600066185,
361
+ "learning_rate": 9.82962913144534e-07,
362
+ "loss": 0.0846,
363
+ "num_tokens": 13349480.0,
364
+ "reward": 1.634689912199974,
365
+ "reward_std": 0.7854569926857948,
366
+ "rewards/accuracy_reward": 0.5301339291036129,
367
+ "rewards/format_reward": 0.9151785746216774,
368
+ "rewards/log_scaled_reward": 0.18937731813639402,
369
+ "step": 24
370
+ },
371
+ {
372
+ "clip_ratio": 0.0,
373
+ "completion_length": 451.1272506713867,
374
+ "epoch": 0.373134328358209,
375
+ "grad_norm": 0.3235984444618225,
376
+ "learning_rate": 9.794098674340966e-07,
377
+ "loss": 0.0424,
378
+ "num_tokens": 13868850.0,
379
+ "reward": 1.9291264861822128,
380
+ "reward_std": 0.6902804151177406,
381
+ "rewards/accuracy_reward": 0.6618303433060646,
382
+ "rewards/format_reward": 0.9441964328289032,
383
+ "rewards/log_scaled_reward": 0.32309958525002,
384
+ "step": 25
385
+ },
386
+ {
387
+ "clip_ratio": 0.0,
388
+ "completion_length": 516.2678871154785,
389
+ "epoch": 0.3880597014925373,
390
+ "grad_norm": 2.3632936477661133,
391
+ "learning_rate": 9.755282581475767e-07,
392
+ "loss": 0.1035,
393
+ "num_tokens": 14469026.0,
394
+ "reward": 1.7230691313743591,
395
+ "reward_std": 0.6944096386432648,
396
+ "rewards/accuracy_reward": 0.5714285708963871,
397
+ "rewards/format_reward": 0.9285714253783226,
398
+ "rewards/log_scaled_reward": 0.22306904755532742,
399
+ "step": 26
400
+ },
401
+ {
402
+ "clip_ratio": 0.0,
403
+ "completion_length": 504.8772506713867,
404
+ "epoch": 0.40298507462686567,
405
+ "grad_norm": 0.28064557909965515,
406
+ "learning_rate": 9.713207455460892e-07,
407
+ "loss": 0.0575,
408
+ "num_tokens": 15048084.0,
409
+ "reward": 1.8045607656240463,
410
+ "reward_std": 0.6639576852321625,
411
+ "rewards/accuracy_reward": 0.6127232238650322,
412
+ "rewards/format_reward": 0.9274553582072258,
413
+ "rewards/log_scaled_reward": 0.2643821220844984,
414
+ "step": 27
415
+ },
416
+ {
417
+ "clip_ratio": 0.0,
418
+ "completion_length": 491.07703018188477,
419
+ "epoch": 0.417910447761194,
420
+ "grad_norm": 0.29867812991142273,
421
+ "learning_rate": 9.667902132486008e-07,
422
+ "loss": 0.0641,
423
+ "num_tokens": 15607481.0,
424
+ "reward": 1.9283190667629242,
425
+ "reward_std": 0.627920113503933,
426
+ "rewards/accuracy_reward": 0.6529017835855484,
427
+ "rewards/format_reward": 0.9575892761349678,
428
+ "rewards/log_scaled_reward": 0.3178279069252312,
429
+ "step": 28
430
+ },
431
+ {
432
+ "clip_ratio": 0.0,
433
+ "completion_length": 559.9810523986816,
434
+ "epoch": 0.43283582089552236,
435
+ "grad_norm": 0.2834532558917999,
436
+ "learning_rate": 9.619397662556433e-07,
437
+ "loss": 0.0735,
438
+ "num_tokens": 16248464.0,
439
+ "reward": 1.7917230874300003,
440
+ "reward_std": 0.527396660298109,
441
+ "rewards/accuracy_reward": 0.6037946455180645,
442
+ "rewards/format_reward": 0.954241082072258,
443
+ "rewards/log_scaled_reward": 0.23368733504321426,
444
+ "step": 29
445
+ },
446
+ {
447
+ "clip_ratio": 0.0,
448
+ "completion_length": 510.0111885070801,
449
+ "epoch": 0.44776119402985076,
450
+ "grad_norm": 0.25292134284973145,
451
+ "learning_rate": 9.567727288213004e-07,
452
+ "loss": 0.0778,
453
+ "num_tokens": 16828154.0,
454
+ "reward": 2.033374920487404,
455
+ "reward_std": 0.5300325341522694,
456
+ "rewards/accuracy_reward": 0.7008928507566452,
457
+ "rewards/format_reward": 0.967633917927742,
458
+ "rewards/log_scaled_reward": 0.36484804935753345,
459
+ "step": 30
460
+ },
461
+ {
462
+ "clip_ratio": 0.0,
463
+ "completion_length": 505.8872985839844,
464
+ "epoch": 0.4626865671641791,
465
+ "grad_norm": 0.2629943788051605,
466
+ "learning_rate": 9.512926421749303e-07,
467
+ "loss": 0.0917,
468
+ "num_tokens": 17405221.0,
469
+ "reward": 1.9693890661001205,
470
+ "reward_std": 0.4761252626776695,
471
+ "rewards/accuracy_reward": 0.6595982164144516,
472
+ "rewards/format_reward": 0.9743303507566452,
473
+ "rewards/log_scaled_reward": 0.33546042814850807,
474
+ "step": 31
475
+ },
476
+ {
477
+ "clip_ratio": 0.0,
478
+ "completion_length": 468.31809997558594,
479
+ "epoch": 0.47761194029850745,
480
+ "grad_norm": 0.25987353920936584,
481
+ "learning_rate": 9.455032620941839e-07,
482
+ "loss": 0.116,
483
+ "num_tokens": 17953570.0,
484
+ "reward": 2.0616614371538162,
485
+ "reward_std": 0.49665234982967377,
486
+ "rewards/accuracy_reward": 0.7075892761349678,
487
+ "rewards/format_reward": 0.962053582072258,
488
+ "rewards/log_scaled_reward": 0.39201846718788147,
489
+ "step": 32
490
+ },
491
+ {
492
+ "clip_ratio": 0.0,
493
+ "completion_length": 504.4788246154785,
494
+ "epoch": 0.4925373134328358,
495
+ "grad_norm": 0.27474886178970337,
496
+ "learning_rate": 9.394085563309826e-07,
497
+ "loss": 0.1112,
498
+ "num_tokens": 18531095.0,
499
+ "reward": 1.9522841572761536,
500
+ "reward_std": 0.5375584103167057,
501
+ "rewards/accuracy_reward": 0.6607142835855484,
502
+ "rewards/format_reward": 0.965401791036129,
503
+ "rewards/log_scaled_reward": 0.32616803981363773,
504
+ "step": 33
505
+ },
506
+ {
507
+ "clip_ratio": 0.0,
508
+ "completion_length": 498.3616371154785,
509
+ "epoch": 0.5074626865671642,
510
+ "grad_norm": 2.524815320968628,
511
+ "learning_rate": 9.330127018922193e-07,
512
+ "loss": 0.1055,
513
+ "num_tokens": 19108059.0,
514
+ "reward": 1.9061091989278793,
515
+ "reward_std": 0.5573387667536736,
516
+ "rewards/accuracy_reward": 0.6339285708963871,
517
+ "rewards/format_reward": 0.9642857164144516,
518
+ "rewards/log_scaled_reward": 0.30789486039429903,
519
+ "step": 34
520
+ },
521
+ {
522
+ "clip_ratio": 0.0,
523
+ "completion_length": 444.20873260498047,
524
+ "epoch": 0.5223880597014925,
525
+ "grad_norm": 0.3097524046897888,
526
+ "learning_rate": 9.26320082177046e-07,
527
+ "loss": 0.1342,
528
+ "num_tokens": 19646062.0,
529
+ "reward": 1.9290964603424072,
530
+ "reward_std": 0.5330292023718357,
531
+ "rewards/accuracy_reward": 0.6272321492433548,
532
+ "rewards/format_reward": 0.9765625,
533
+ "rewards/log_scaled_reward": 0.325301731005311,
534
+ "step": 35
535
+ },
536
+ {
537
+ "clip_ratio": 0.0,
538
+ "completion_length": 420.0848388671875,
539
+ "epoch": 0.5373134328358209,
540
+ "grad_norm": 0.3493054211139679,
541
+ "learning_rate": 9.19335283972712e-07,
542
+ "loss": 0.1097,
543
+ "num_tokens": 20165994.0,
544
+ "reward": 1.937290906906128,
545
+ "reward_std": 0.5374783836305141,
546
+ "rewards/accuracy_reward": 0.621651791036129,
547
+ "rewards/format_reward": 0.9732142835855484,
548
+ "rewards/log_scaled_reward": 0.34242471773177385,
549
+ "step": 36
550
+ },
551
+ {
552
+ "clip_ratio": 0.0,
553
+ "completion_length": 367.64064025878906,
554
+ "epoch": 0.5522388059701493,
555
+ "grad_norm": 0.39517220854759216,
556
+ "learning_rate": 9.120630943110077e-07,
557
+ "loss": 0.1309,
558
+ "num_tokens": 20622824.0,
559
+ "reward": 1.9518826305866241,
560
+ "reward_std": 0.5057090371847153,
561
+ "rewards/accuracy_reward": 0.6294642947614193,
562
+ "rewards/format_reward": 0.9765625074505806,
563
+ "rewards/log_scaled_reward": 0.34585576388053596,
564
+ "step": 37
565
+ },
566
+ {
567
+ "clip_ratio": 0.0,
568
+ "completion_length": 344.1038112640381,
569
+ "epoch": 0.5671641791044776,
570
+ "grad_norm": 1.3338432312011719,
571
+ "learning_rate": 9.045084971874737e-07,
572
+ "loss": 0.1333,
573
+ "num_tokens": 21076925.0,
574
+ "reward": 1.9407142996788025,
575
+ "reward_std": 0.5814780332148075,
576
+ "rewards/accuracy_reward": 0.6127232126891613,
577
+ "rewards/format_reward": 0.9709821417927742,
578
+ "rewards/log_scaled_reward": 0.35700881760567427,
579
+ "step": 38
580
+ },
581
+ {
582
+ "clip_ratio": 0.0,
583
+ "completion_length": 277.19309425354004,
584
+ "epoch": 0.582089552238806,
585
+ "grad_norm": 0.4697703421115875,
586
+ "learning_rate": 8.966766701456176e-07,
587
+ "loss": 0.1089,
588
+ "num_tokens": 21451954.0,
589
+ "reward": 1.8939976394176483,
590
+ "reward_std": 0.531686820089817,
591
+ "rewards/accuracy_reward": 0.5703125037252903,
592
+ "rewards/format_reward": 0.984375,
593
+ "rewards/log_scaled_reward": 0.33931003510951996,
594
+ "step": 39
595
+ },
596
+ {
597
+ "clip_ratio": 0.0,
598
+ "completion_length": 292.2109489440918,
599
+ "epoch": 0.5970149253731343,
600
+ "grad_norm": 0.42749401926994324,
601
+ "learning_rate": 8.885729807284854e-07,
602
+ "loss": 0.1051,
603
+ "num_tokens": 21837471.0,
604
+ "reward": 1.7810039222240448,
605
+ "reward_std": 0.5695139020681381,
606
+ "rewards/accuracy_reward": 0.5212053544819355,
607
+ "rewards/format_reward": 0.9832589253783226,
608
+ "rewards/log_scaled_reward": 0.2765395335154608,
609
+ "step": 40
610
+ },
611
+ {
612
+ "clip_ratio": 0.0,
613
+ "completion_length": 227.70425033569336,
614
+ "epoch": 0.6119402985074627,
615
+ "grad_norm": 0.8301727771759033,
616
+ "learning_rate": 8.802029828000155e-07,
617
+ "loss": 0.1365,
618
+ "num_tokens": 22182102.0,
619
+ "reward": 1.7540639638900757,
620
+ "reward_std": 0.504084050655365,
621
+ "rewards/accuracy_reward": 0.4899553582072258,
622
+ "rewards/format_reward": 0.9754464253783226,
623
+ "rewards/log_scaled_reward": 0.2886621206998825,
624
+ "step": 41
625
+ },
626
+ {
627
+ "clip_ratio": 0.0,
628
+ "completion_length": 186.12500953674316,
629
+ "epoch": 0.6268656716417911,
630
+ "grad_norm": 0.6249210238456726,
631
+ "learning_rate": 8.71572412738697e-07,
632
+ "loss": 0.1336,
633
+ "num_tokens": 22471350.0,
634
+ "reward": 1.9936908185482025,
635
+ "reward_std": 0.6450418382883072,
636
+ "rewards/accuracy_reward": 0.5926339402794838,
637
+ "rewards/format_reward": 0.9888392761349678,
638
+ "rewards/log_scaled_reward": 0.41221751645207405,
639
+ "step": 42
640
+ },
641
+ {
642
+ "clip_ratio": 0.0,
643
+ "completion_length": 169.09710693359375,
644
+ "epoch": 0.6417910447761194,
645
+ "grad_norm": 0.7566676139831543,
646
+ "learning_rate": 8.626871855061437e-07,
647
+ "loss": 0.1662,
648
+ "num_tokens": 22758477.0,
649
+ "reward": 1.8480805903673172,
650
+ "reward_std": 0.5597276613116264,
651
+ "rewards/accuracy_reward": 0.511160708963871,
652
+ "rewards/format_reward": 0.9944196343421936,
653
+ "rewards/log_scaled_reward": 0.3425001185387373,
654
+ "step": 43
655
+ },
656
+ {
657
+ "clip_ratio": 0.0,
658
+ "completion_length": 152.94978141784668,
659
+ "epoch": 0.6567164179104478,
660
+ "grad_norm": 0.9348928332328796,
661
+ "learning_rate": 8.535533905932737e-07,
662
+ "loss": 0.1287,
663
+ "num_tokens": 23022432.0,
664
+ "reward": 1.8545437455177307,
665
+ "reward_std": 0.5482046529650688,
666
+ "rewards/accuracy_reward": 0.5078125037252903,
667
+ "rewards/format_reward": 0.9921874925494194,
668
+ "rewards/log_scaled_reward": 0.35454366356134415,
669
+ "step": 44
670
+ },
671
+ {
672
+ "clip_ratio": 0.0,
673
+ "completion_length": 128.6517915725708,
674
+ "epoch": 0.6716417910447762,
675
+ "grad_norm": 1.1114394664764404,
676
+ "learning_rate": 8.441772878468769e-07,
677
+ "loss": 0.1287,
678
+ "num_tokens": 23263912.0,
679
+ "reward": 1.7271955758333206,
680
+ "reward_std": 0.5110182501375675,
681
+ "rewards/accuracy_reward": 0.4330357164144516,
682
+ "rewards/format_reward": 0.9977678507566452,
683
+ "rewards/log_scaled_reward": 0.29639193043112755,
684
+ "step": 45
685
+ },
686
+ {
687
+ "clip_ratio": 0.0,
688
+ "completion_length": 143.97545337677002,
689
+ "epoch": 0.6865671641791045,
690
+ "grad_norm": 1.1403673887252808,
691
+ "learning_rate": 8.34565303179429e-07,
692
+ "loss": 0.1742,
693
+ "num_tokens": 23526706.0,
694
+ "reward": 1.6235045939683914,
695
+ "reward_std": 0.5148132182657719,
696
+ "rewards/accuracy_reward": 0.386160708963871,
697
+ "rewards/format_reward": 0.987723208963871,
698
+ "rewards/log_scaled_reward": 0.24962060060352087,
699
+ "step": 46
700
+ },
701
+ {
702
+ "clip_ratio": 0.0,
703
+ "completion_length": 104.42187976837158,
704
+ "epoch": 0.7014925373134329,
705
+ "grad_norm": 1.6832914352416992,
706
+ "learning_rate": 8.247240241650917e-07,
707
+ "loss": 0.1296,
708
+ "num_tokens": 23736324.0,
709
+ "reward": 1.690138816833496,
710
+ "reward_std": 0.5133109800517559,
711
+ "rewards/accuracy_reward": 0.4051339328289032,
712
+ "rewards/format_reward": 0.9933035671710968,
713
+ "rewards/log_scaled_reward": 0.29170125164091587,
714
+ "step": 47
715
+ },
716
+ {
717
+ "clip_ratio": 0.0,
718
+ "completion_length": 92.65848731994629,
719
+ "epoch": 0.7164179104477612,
720
+ "grad_norm": 1.6501548290252686,
721
+ "learning_rate": 8.146601955249187e-07,
722
+ "loss": 0.1411,
723
+ "num_tokens": 23946162.0,
724
+ "reward": 1.6304273456335068,
725
+ "reward_std": 0.49451132118701935,
726
+ "rewards/accuracy_reward": 0.36941964365541935,
727
+ "rewards/format_reward": 0.995535708963871,
728
+ "rewards/log_scaled_reward": 0.26547193340957165,
729
+ "step": 48
730
+ },
731
+ {
732
+ "clip_ratio": 0.0,
733
+ "completion_length": 78.72098636627197,
734
+ "epoch": 0.7313432835820896,
735
+ "grad_norm": 2.571906566619873,
736
+ "learning_rate": 8.043807145043603e-07,
737
+ "loss": 0.1793,
738
+ "num_tokens": 24153096.0,
739
+ "reward": 1.581173524260521,
740
+ "reward_std": 0.43424950167536736,
741
+ "rewards/accuracy_reward": 0.33748282864689827,
742
+ "rewards/format_reward": 0.9966517835855484,
743
+ "rewards/log_scaled_reward": 0.24970022030174732,
744
+ "step": 49
745
+ },
746
+ {
747
+ "clip_ratio": 0.0,
748
+ "completion_length": 64.70759201049805,
749
+ "epoch": 0.746268656716418,
750
+ "grad_norm": 2.7052981853485107,
751
+ "learning_rate": 7.938926261462365e-07,
752
+ "loss": 0.1791,
753
+ "num_tokens": 24349226.0,
754
+ "reward": 1.5854334235191345,
755
+ "reward_std": 0.3977060168981552,
756
+ "rewards/accuracy_reward": 0.33593750186264515,
757
+ "rewards/format_reward": 0.9955357015132904,
758
+ "rewards/log_scaled_reward": 0.2539601270109415,
759
+ "step": 50
760
+ },
761
+ {
762
+ "clip_ratio": 0.0,
763
+ "completion_length": 46.40067148208618,
764
+ "epoch": 0.7611940298507462,
765
+ "grad_norm": 6.834611415863037,
766
+ "learning_rate": 7.832031184624164e-07,
767
+ "loss": 0.1055,
768
+ "num_tokens": 24528585.0,
769
+ "reward": 1.5368833392858505,
770
+ "reward_std": 0.47740813344717026,
771
+ "rewards/accuracy_reward": 0.30133928544819355,
772
+ "rewards/format_reward": 0.9988839253783226,
773
+ "rewards/log_scaled_reward": 0.2366600539535284,
774
+ "step": 51
775
+ },
776
+ {
777
+ "clip_ratio": 0.0,
778
+ "completion_length": 38.36384057998657,
779
+ "epoch": 0.7761194029850746,
780
+ "grad_norm": 4.969581127166748,
781
+ "learning_rate": 7.723195175075135e-07,
782
+ "loss": 0.1206,
783
+ "num_tokens": 24691607.0,
784
+ "reward": 1.415985830128193,
785
+ "reward_std": 0.29450324457138777,
786
+ "rewards/accuracy_reward": 0.2377232169965282,
787
+ "rewards/format_reward": 0.9977678507566452,
788
+ "rewards/log_scaled_reward": 0.18049467215314507,
789
+ "step": 52
790
+ },
791
+ {
792
+ "clip_ratio": 0.0,
793
+ "completion_length": 33.99218940734863,
794
+ "epoch": 0.7910447761194029,
795
+ "grad_norm": 4.195903778076172,
796
+ "learning_rate": 7.612492823579744e-07,
797
+ "loss": 0.145,
798
+ "num_tokens": 24849736.0,
799
+ "reward": 1.5035328567028046,
800
+ "reward_std": 0.34667503647506237,
801
+ "rewards/accuracy_reward": 0.27790178544819355,
802
+ "rewards/format_reward": 0.9988839253783226,
803
+ "rewards/log_scaled_reward": 0.2267470918595791,
804
+ "step": 53
805
+ },
806
+ {
807
+ "clip_ratio": 0.0,
808
+ "completion_length": 26.474331378936768,
809
+ "epoch": 0.8059701492537313,
810
+ "grad_norm": 24.98171615600586,
811
+ "learning_rate": 7.5e-07,
812
+ "loss": 0.0613,
813
+ "num_tokens": 25008289.0,
814
+ "reward": 1.5673803389072418,
815
+ "reward_std": 0.28740744665265083,
816
+ "rewards/accuracy_reward": 0.3069196417927742,
817
+ "rewards/format_reward": 1.0,
818
+ "rewards/log_scaled_reward": 0.26046060863882303,
819
+ "step": 54
820
+ },
821
+ {
822
+ "clip_ratio": 0.0,
823
+ "completion_length": 25.195313692092896,
824
+ "epoch": 0.8208955223880597,
825
+ "grad_norm": 5.9815216064453125,
826
+ "learning_rate": 7.385793801298042e-07,
827
+ "loss": 0.0376,
828
+ "num_tokens": 25169984.0,
829
+ "reward": 1.4668240398168564,
830
+ "reward_std": 0.3536365833133459,
831
+ "rewards/accuracy_reward": 0.25369161926209927,
832
+ "rewards/format_reward": 1.0,
833
+ "rewards/log_scaled_reward": 0.2157079027965665,
834
+ "step": 55
835
+ },
836
+ {
837
+ "clip_ratio": 0.0,
838
+ "completion_length": 23.91294765472412,
839
+ "epoch": 0.835820895522388,
840
+ "grad_norm": 7.0327372550964355,
841
+ "learning_rate": 7.269952498697734e-07,
842
+ "loss": 0.0386,
843
+ "num_tokens": 25322466.0,
844
+ "reward": 1.4718168079853058,
845
+ "reward_std": 0.3211175389587879,
846
+ "rewards/accuracy_reward": 0.2578124962747097,
847
+ "rewards/format_reward": 1.0,
848
+ "rewards/log_scaled_reward": 0.21400425024330616,
849
+ "step": 56
850
+ },
851
+ {
852
+ "clip_ratio": 0.0,
853
+ "completion_length": 23.36049222946167,
854
+ "epoch": 0.8507462686567164,
855
+ "grad_norm": 5.700937271118164,
856
+ "learning_rate": 7.152555484041475e-07,
857
+ "loss": 0.0265,
858
+ "num_tokens": 25466397.0,
859
+ "reward": 1.5502240508794785,
860
+ "reward_std": 0.2749287262558937,
861
+ "rewards/accuracy_reward": 0.296875,
862
+ "rewards/format_reward": 0.9988839253783226,
863
+ "rewards/log_scaled_reward": 0.25446509197354317,
864
+ "step": 57
865
+ },
866
+ {
867
+ "clip_ratio": 0.0,
868
+ "completion_length": 23.949777364730835,
869
+ "epoch": 0.8656716417910447,
870
+ "grad_norm": 5.249198913574219,
871
+ "learning_rate": 7.033683215379002e-07,
872
+ "loss": 0.023,
873
+ "num_tokens": 25613824.0,
874
+ "reward": 1.5308541655540466,
875
+ "reward_std": 0.26417338382452726,
876
+ "rewards/accuracy_reward": 0.28794642724096775,
877
+ "rewards/format_reward": 0.9988839253783226,
878
+ "rewards/log_scaled_reward": 0.2440237421542406,
879
+ "step": 58
880
+ },
881
+ {
882
+ "clip_ratio": 0.0,
883
+ "completion_length": 23.375000953674316,
884
+ "epoch": 0.8805970149253731,
885
+ "grad_norm": 8.684850692749023,
886
+ "learning_rate": 6.913417161825449e-07,
887
+ "loss": 0.0289,
888
+ "num_tokens": 25760176.0,
889
+ "reward": 1.566646233201027,
890
+ "reward_std": 0.22551130689680576,
891
+ "rewards/accuracy_reward": 0.30691963620483875,
892
+ "rewards/format_reward": 0.9966517761349678,
893
+ "rewards/log_scaled_reward": 0.26307476311922073,
894
+ "step": 59
895
+ },
896
+ {
897
+ "clip_ratio": 0.0,
898
+ "completion_length": 21.989956378936768,
899
+ "epoch": 0.8955223880597015,
900
+ "grad_norm": 34.08607864379883,
901
+ "learning_rate": 6.7918397477265e-07,
902
+ "loss": 0.0144,
903
+ "num_tokens": 25914911.0,
904
+ "reward": 1.628076210618019,
905
+ "reward_std": 0.3200679961591959,
906
+ "rewards/accuracy_reward": 0.33705356903374195,
907
+ "rewards/format_reward": 0.9966517761349678,
908
+ "rewards/log_scaled_reward": 0.2943707942031324,
909
+ "step": 60
910
+ },
911
+ {
912
+ "clip_ratio": 0.0,
913
+ "completion_length": 22.574777841567993,
914
+ "epoch": 0.9104477611940298,
915
+ "grad_norm": 5.27999210357666,
916
+ "learning_rate": 6.669034296168854e-07,
917
+ "loss": 0.0217,
918
+ "num_tokens": 26077314.0,
919
+ "reward": 1.6822472661733627,
920
+ "reward_std": 0.2845423389226198,
921
+ "rewards/accuracy_reward": 0.3649553582072258,
922
+ "rewards/format_reward": 0.995535708963871,
923
+ "rewards/log_scaled_reward": 0.3217560853809118,
924
+ "step": 61
925
+ },
926
+ {
927
+ "clip_ratio": 0.0,
928
+ "completion_length": 22.198662042617798,
929
+ "epoch": 0.9253731343283582,
930
+ "grad_norm": 4.5566277503967285,
931
+ "learning_rate": 6.545084971874736e-07,
932
+ "loss": 0.0064,
933
+ "num_tokens": 26222620.0,
934
+ "reward": 1.5729791224002838,
935
+ "reward_std": 0.22709419997408986,
936
+ "rewards/accuracy_reward": 0.31138393096625805,
937
+ "rewards/format_reward": 0.9921874925494194,
938
+ "rewards/log_scaled_reward": 0.2694076579064131,
939
+ "step": 62
940
+ },
941
+ {
942
+ "clip_ratio": 0.0,
943
+ "completion_length": 21.32924246788025,
944
+ "epoch": 0.9402985074626866,
945
+ "grad_norm": 3.140645980834961,
946
+ "learning_rate": 6.420076723519614e-07,
947
+ "loss": 0.0068,
948
+ "num_tokens": 26371243.0,
949
+ "reward": 1.6745910048484802,
950
+ "reward_std": 0.20502757839858532,
951
+ "rewards/accuracy_reward": 0.3582589291036129,
952
+ "rewards/format_reward": 1.0,
953
+ "rewards/log_scaled_reward": 0.3163319919258356,
954
+ "step": 63
955
+ },
956
+ {
957
+ "clip_ratio": 0.0,
958
+ "completion_length": 21.34709930419922,
959
+ "epoch": 0.9552238805970149,
960
+ "grad_norm": 4.071537494659424,
961
+ "learning_rate": 6.294095225512604e-07,
962
+ "loss": 0.0055,
963
+ "num_tokens": 26527050.0,
964
+ "reward": 1.5537814646959305,
965
+ "reward_std": 0.30234235525131226,
966
+ "rewards/accuracy_reward": 0.2979910681024194,
967
+ "rewards/format_reward": 1.0,
968
+ "rewards/log_scaled_reward": 0.25579037982970476,
969
+ "step": 64
970
+ },
971
+ {
972
+ "clip_ratio": 0.0,
973
+ "completion_length": 21.24776864051819,
974
+ "epoch": 0.9701492537313433,
975
+ "grad_norm": 4.297541618347168,
976
+ "learning_rate": 6.167226819279527e-07,
977
+ "loss": 0.0114,
978
+ "num_tokens": 26679280.0,
979
+ "reward": 1.5483618080615997,
980
+ "reward_std": 0.1530774086713791,
981
+ "rewards/accuracy_reward": 0.2968749953433871,
982
+ "rewards/format_reward": 0.9966517835855484,
983
+ "rewards/log_scaled_reward": 0.25483495742082596,
984
+ "step": 65
985
+ },
986
+ {
987
+ "clip_ratio": 0.0,
988
+ "completion_length": 21.180555820465088,
989
+ "epoch": 0.9850746268656716,
990
+ "grad_norm": 3.6126675605773926,
991
+ "learning_rate": 6.039558454088795e-07,
992
+ "loss": 0.009,
993
+ "num_tokens": 26828211.0,
994
+ "reward": 1.6490006893873215,
995
+ "reward_std": 0.29192496836185455,
996
+ "rewards/accuracy_reward": 0.3459821417927742,
997
+ "rewards/format_reward": 0.9988839253783226,
998
+ "rewards/log_scaled_reward": 0.304134588688612,
999
+ "step": 66
1000
+ },
1001
+ {
1002
+ "clip_ratio": 0.0,
1003
+ "completion_length": 20.77455449104309,
1004
+ "epoch": 1.0149253731343284,
1005
+ "grad_norm": 4.067701816558838,
1006
+ "learning_rate": 5.911177627460738e-07,
1007
+ "loss": 0.0094,
1008
+ "num_tokens": 26965209.0,
1009
+ "reward": 1.7030873149633408,
1010
+ "reward_std": 0.2512226551771164,
1011
+ "rewards/accuracy_reward": 0.37388391979038715,
1012
+ "rewards/format_reward": 0.9966517761349678,
1013
+ "rewards/log_scaled_reward": 0.3325514607131481,
1014
+ "step": 67
1015
+ },
1016
+ {
1017
+ "clip_ratio": 0.0,
1018
+ "completion_length": 20.35267925262451,
1019
+ "epoch": 1.0298507462686568,
1020
+ "grad_norm": 3.4311537742614746,
1021
+ "learning_rate": 5.782172325201155e-07,
1022
+ "loss": 0.0126,
1023
+ "num_tokens": 27116397.0,
1024
+ "reward": 1.5380910784006119,
1025
+ "reward_std": 0.20529233757406473,
1026
+ "rewards/accuracy_reward": 0.2901785708963871,
1027
+ "rewards/format_reward": 0.9988839253783226,
1028
+ "rewards/log_scaled_reward": 0.24902847222983837,
1029
+ "step": 68
1030
+ },
1031
+ {
1032
+ "clip_ratio": 0.0,
1033
+ "completion_length": 20.400670528411865,
1034
+ "epoch": 1.044776119402985,
1035
+ "grad_norm": 4.965053081512451,
1036
+ "learning_rate": 5.652630961100258e-07,
1037
+ "loss": 0.0182,
1038
+ "num_tokens": 27260012.0,
1039
+ "reward": 1.4891109764575958,
1040
+ "reward_std": 0.2253081511007622,
1041
+ "rewards/accuracy_reward": 0.26562499813735485,
1042
+ "rewards/format_reward": 0.9988839253783226,
1043
+ "rewards/log_scaled_reward": 0.22460200637578964,
1044
+ "step": 69
1045
+ },
1046
+ {
1047
+ "clip_ratio": 0.0,
1048
+ "completion_length": 19.54799175262451,
1049
+ "epoch": 1.0597014925373134,
1050
+ "grad_norm": 9.666987419128418,
1051
+ "learning_rate": 5.522642316338268e-07,
1052
+ "loss": 0.0249,
1053
+ "num_tokens": 27404823.0,
1054
+ "reward": 1.5430727303028107,
1055
+ "reward_std": 0.2664187829941511,
1056
+ "rewards/accuracy_reward": 0.2924107164144516,
1057
+ "rewards/format_reward": 0.9977678507566452,
1058
+ "rewards/log_scaled_reward": 0.25289412308484316,
1059
+ "step": 70
1060
+ },
1061
+ {
1062
+ "clip_ratio": 0.0,
1063
+ "completion_length": 18.02120590209961,
1064
+ "epoch": 1.0746268656716418,
1065
+ "grad_norm": 15.713820457458496,
1066
+ "learning_rate": 5.392295478639225e-07,
1067
+ "loss": 0.0167,
1068
+ "num_tokens": 27555962.0,
1069
+ "reward": 1.1339266449213028,
1070
+ "reward_std": 0.34662946686148643,
1071
+ "rewards/accuracy_reward": 0.1026785708963871,
1072
+ "rewards/format_reward": 0.9676339328289032,
1073
+ "rewards/log_scaled_reward": 0.06361408122756984,
1074
+ "step": 71
1075
+ },
1076
+ {
1077
+ "clip_ratio": 0.0,
1078
+ "completion_length": 19.39732265472412,
1079
+ "epoch": 1.0895522388059702,
1080
+ "grad_norm": 21.441844940185547,
1081
+ "learning_rate": 5.26167978121472e-07,
1082
+ "loss": 0.0653,
1083
+ "num_tokens": 27696134.0,
1084
+ "reward": 1.0387937128543854,
1085
+ "reward_std": 0.14305981155484915,
1086
+ "rewards/accuracy_reward": 0.04464285704307258,
1087
+ "rewards/format_reward": 0.9866071417927742,
1088
+ "rewards/log_scaled_reward": 0.0075436777296999935,
1089
+ "step": 72
1090
+ },
1091
+ {
1092
+ "clip_ratio": 0.0,
1093
+ "completion_length": 15.95089340209961,
1094
+ "epoch": 1.1044776119402986,
1095
+ "grad_norm": 12.856225967407227,
1096
+ "learning_rate": 5.130884741539366e-07,
1097
+ "loss": 0.018,
1098
+ "num_tokens": 27843106.0,
1099
+ "reward": 0.9690398126840591,
1100
+ "reward_std": 0.01615892370318761,
1101
+ "rewards/accuracy_reward": 0.0022321429569274187,
1102
+ "rewards/format_reward": 0.9988839253783226,
1103
+ "rewards/log_scaled_reward": -0.03207633784040809,
1104
+ "step": 73
1105
+ },
1106
+ {
1107
+ "clip_ratio": 0.0,
1108
+ "completion_length": 15.233259677886963,
1109
+ "epoch": 1.1194029850746268,
1110
+ "grad_norm": 1.9851570129394531,
1111
+ "learning_rate": 5e-07,
1112
+ "loss": 0.0015,
1113
+ "num_tokens": 27969987.0,
1114
+ "reward": 0.9649922177195549,
1115
+ "reward_std": 0.0031741062664423225,
1116
+ "rewards/accuracy_reward": 0.0,
1117
+ "rewards/format_reward": 0.9988839253783226,
1118
+ "rewards/log_scaled_reward": -0.03389178216457367,
1119
+ "step": 74
1120
+ },
1121
+ {
1122
+ "clip_ratio": 0.0,
1123
+ "completion_length": 15.01897418498993,
1124
+ "epoch": 1.1343283582089552,
1125
+ "grad_norm": 0.0,
1126
+ "learning_rate": 4.869115258460634e-07,
1127
+ "loss": 0.0,
1128
+ "num_tokens": 28107828.0,
1129
+ "reward": 0.9661169648170471,
1130
+ "reward_std": 0.0,
1131
+ "rewards/accuracy_reward": 0.0,
1132
+ "rewards/format_reward": 1.0,
1133
+ "rewards/log_scaled_reward": -0.03388310596346855,
1134
+ "step": 75
1135
+ },
1136
+ {
1137
+ "clip_ratio": 0.0,
1138
+ "completion_length": 14.996652722358704,
1139
+ "epoch": 1.1492537313432836,
1140
+ "grad_norm": 1.0442914962768555,
1141
+ "learning_rate": 4.7383202187852804e-07,
1142
+ "loss": -0.0006,
1143
+ "num_tokens": 28248081.0,
1144
+ "reward": 0.9638938158750534,
1145
+ "reward_std": 0.006288029253482819,
1146
+ "rewards/accuracy_reward": 0.0,
1147
+ "rewards/format_reward": 0.9977678507566452,
1148
+ "rewards/log_scaled_reward": -0.03387411683797836,
1149
+ "step": 76
1150
+ },
1151
+ {
1152
+ "clip_ratio": 0.0,
1153
+ "completion_length": 15.00334918498993,
1154
+ "epoch": 1.164179104477612,
1155
+ "grad_norm": 1.5838154554367065,
1156
+ "learning_rate": 4.6077045213607755e-07,
1157
+ "loss": 0.0005,
1158
+ "num_tokens": 28386956.0,
1159
+ "reward": 0.9627697318792343,
1160
+ "reward_std": 0.009467384777963161,
1161
+ "rewards/accuracy_reward": 0.0,
1162
+ "rewards/format_reward": 0.9966517835855484,
1163
+ "rewards/log_scaled_reward": -0.0338821173645556,
1164
+ "step": 77
1165
+ },
1166
+ {
1167
+ "clip_ratio": 0.0,
1168
+ "completion_length": 15.000000953674316,
1169
+ "epoch": 1.1791044776119404,
1170
+ "grad_norm": 1.337672472000122,
1171
+ "learning_rate": 4.477357683661733e-07,
1172
+ "loss": -0.0,
1173
+ "num_tokens": 28527204.0,
1174
+ "reward": 0.9638830795884132,
1175
+ "reward_std": 0.006318369880318642,
1176
+ "rewards/accuracy_reward": 0.0,
1177
+ "rewards/format_reward": 0.9977678582072258,
1178
+ "rewards/log_scaled_reward": -0.033884843811392784,
1179
+ "step": 78
1180
+ },
1181
+ {
1182
+ "clip_ratio": 0.0,
1183
+ "completion_length": 15.000000953674316,
1184
+ "epoch": 1.1940298507462686,
1185
+ "grad_norm": 0.0,
1186
+ "learning_rate": 4.347369038899743e-07,
1187
+ "loss": 0.0,
1188
+ "num_tokens": 28671788.0,
1189
+ "reward": 0.9661169648170471,
1190
+ "reward_std": 0.0,
1191
+ "rewards/accuracy_reward": 0.0,
1192
+ "rewards/format_reward": 1.0,
1193
+ "rewards/log_scaled_reward": -0.03388310596346855,
1194
+ "step": 79
1195
+ },
1196
+ {
1197
+ "clip_ratio": 0.0,
1198
+ "completion_length": 15.000000953674316,
1199
+ "epoch": 1.208955223880597,
1200
+ "grad_norm": 0.0,
1201
+ "learning_rate": 4.2178276747988444e-07,
1202
+ "loss": 0.0,
1203
+ "num_tokens": 28811644.0,
1204
+ "reward": 0.9661169648170471,
1205
+ "reward_std": 0.0,
1206
+ "rewards/accuracy_reward": 0.0,
1207
+ "rewards/format_reward": 1.0,
1208
+ "rewards/log_scaled_reward": -0.03388310596346855,
1209
+ "step": 80
1210
+ },
1211
+ {
1212
+ "clip_ratio": 0.0,
1213
+ "completion_length": 15.000000953674316,
1214
+ "epoch": 1.2238805970149254,
1215
+ "grad_norm": 0.0,
1216
+ "learning_rate": 4.0888223725392624e-07,
1217
+ "loss": 0.0,
1218
+ "num_tokens": 28949996.0,
1219
+ "reward": 0.9661169648170471,
1220
+ "reward_std": 0.0,
1221
+ "rewards/accuracy_reward": 0.0,
1222
+ "rewards/format_reward": 1.0,
1223
+ "rewards/log_scaled_reward": -0.03388310596346855,
1224
+ "step": 81
1225
+ },
1226
+ {
1227
+ "clip_ratio": 0.0,
1228
+ "completion_length": 15.000000953674316,
1229
+ "epoch": 1.2388059701492538,
1230
+ "grad_norm": 0.0,
1231
+ "learning_rate": 3.960441545911204e-07,
1232
+ "loss": 0.0,
1233
+ "num_tokens": 29100540.0,
1234
+ "reward": 0.9661169648170471,
1235
+ "reward_std": 0.0,
1236
+ "rewards/accuracy_reward": 0.0,
1237
+ "rewards/format_reward": 1.0,
1238
+ "rewards/log_scaled_reward": -0.03388310596346855,
1239
+ "step": 82
1240
+ },
1241
+ {
1242
+ "clip_ratio": 0.0,
1243
+ "completion_length": 15.000000953674316,
1244
+ "epoch": 1.2537313432835822,
1245
+ "grad_norm": 0.9363570213317871,
1246
+ "learning_rate": 3.8327731807204744e-07,
1247
+ "loss": -0.0,
1248
+ "num_tokens": 29242748.0,
1249
+ "reward": 0.9650017619132996,
1250
+ "reward_std": 0.0031542566139250994,
1251
+ "rewards/accuracy_reward": 0.0,
1252
+ "rewards/format_reward": 0.9988839253783226,
1253
+ "rewards/log_scaled_reward": -0.03388223238289356,
1254
+ "step": 83
1255
+ },
1256
+ {
1257
+ "clip_ratio": 0.0,
1258
+ "completion_length": 15.000000953674316,
1259
+ "epoch": 1.2686567164179103,
1260
+ "grad_norm": 0.0,
1261
+ "learning_rate": 3.7059047744873955e-07,
1262
+ "loss": 0.0,
1263
+ "num_tokens": 29394812.0,
1264
+ "reward": 0.9661169648170471,
1265
+ "reward_std": 0.0,
1266
+ "rewards/accuracy_reward": 0.0,
1267
+ "rewards/format_reward": 1.0,
1268
+ "rewards/log_scaled_reward": -0.03388310596346855,
1269
+ "step": 84
1270
+ },
1271
+ {
1272
+ "clip_ratio": 0.0,
1273
+ "completion_length": 15.000000953674316,
1274
+ "epoch": 1.2835820895522387,
1275
+ "grad_norm": 0.0,
1276
+ "learning_rate": 3.5799232764803867e-07,
1277
+ "loss": 0.0,
1278
+ "num_tokens": 29526756.0,
1279
+ "reward": 0.9661169648170471,
1280
+ "reward_std": 0.0,
1281
+ "rewards/accuracy_reward": 0.0,
1282
+ "rewards/format_reward": 1.0,
1283
+ "rewards/log_scaled_reward": -0.03388310596346855,
1284
+ "step": 85
1285
+ },
1286
+ {
1287
+ "clip_ratio": 0.0,
1288
+ "completion_length": 15.000000953674316,
1289
+ "epoch": 1.2985074626865671,
1290
+ "grad_norm": 0.0,
1291
+ "learning_rate": 3.454915028125263e-07,
1292
+ "loss": 0.0,
1293
+ "num_tokens": 29662892.0,
1294
+ "reward": 0.9661169648170471,
1295
+ "reward_std": 0.0,
1296
+ "rewards/accuracy_reward": 0.0,
1297
+ "rewards/format_reward": 1.0,
1298
+ "rewards/log_scaled_reward": -0.03388310596346855,
1299
+ "step": 86
1300
+ },
1301
+ {
1302
+ "clip_ratio": 0.0,
1303
+ "completion_length": 15.000000953674316,
1304
+ "epoch": 1.3134328358208955,
1305
+ "grad_norm": 0.0,
1306
+ "learning_rate": 3.330965703831146e-07,
1307
+ "loss": 0.0,
1308
+ "num_tokens": 29808492.0,
1309
+ "reward": 0.975348062813282,
1310
+ "reward_std": 0.0,
1311
+ "rewards/accuracy_reward": 0.0,
1312
+ "rewards/format_reward": 1.0,
1313
+ "rewards/log_scaled_reward": -0.024652006570249796,
1314
+ "step": 87
1315
+ },
1316
+ {
1317
+ "clip_ratio": 0.0,
1318
+ "completion_length": 15.000000953674316,
1319
+ "epoch": 1.328358208955224,
1320
+ "grad_norm": 0.0,
1321
+ "learning_rate": 3.2081602522734985e-07,
1322
+ "loss": 0.0,
1323
+ "num_tokens": 29960316.0,
1324
+ "reward": 0.9661169648170471,
1325
+ "reward_std": 0.0,
1326
+ "rewards/accuracy_reward": 0.0,
1327
+ "rewards/format_reward": 1.0,
1328
+ "rewards/log_scaled_reward": -0.03388310596346855,
1329
+ "step": 88
1330
+ },
1331
+ {
1332
+ "clip_ratio": 0.0,
1333
+ "completion_length": 15.000000953674316,
1334
+ "epoch": 1.3432835820895521,
1335
+ "grad_norm": 0.0,
1336
+ "learning_rate": 3.086582838174551e-07,
1337
+ "loss": 0.0,
1338
+ "num_tokens": 30099644.0,
1339
+ "reward": 0.9661169648170471,
1340
+ "reward_std": 0.0,
1341
+ "rewards/accuracy_reward": 0.0,
1342
+ "rewards/format_reward": 1.0,
1343
+ "rewards/log_scaled_reward": -0.03388310596346855,
1344
+ "step": 89
1345
+ },
1346
+ {
1347
+ "clip_ratio": 0.0,
1348
+ "completion_length": 15.005581259727478,
1349
+ "epoch": 1.3582089552238805,
1350
+ "grad_norm": 0.0,
1351
+ "learning_rate": 2.9663167846209996e-07,
1352
+ "loss": 0.0,
1353
+ "num_tokens": 30246217.0,
1354
+ "reward": 0.9661169648170471,
1355
+ "reward_std": 0.0,
1356
+ "rewards/accuracy_reward": 0.0,
1357
+ "rewards/format_reward": 1.0,
1358
+ "rewards/log_scaled_reward": -0.03388310596346855,
1359
+ "step": 90
1360
+ },
1361
+ {
1362
+ "clip_ratio": 0.0,
1363
+ "completion_length": 15.000000953674316,
1364
+ "epoch": 1.373134328358209,
1365
+ "grad_norm": 0.0,
1366
+ "learning_rate": 2.847444515958523e-07,
1367
+ "loss": 0.0,
1368
+ "num_tokens": 30391785.0,
1369
+ "reward": 0.9661169648170471,
1370
+ "reward_std": 0.0,
1371
+ "rewards/accuracy_reward": 0.0,
1372
+ "rewards/format_reward": 1.0,
1373
+ "rewards/log_scaled_reward": -0.03388310596346855,
1374
+ "step": 91
1375
+ },
1376
+ {
1377
+ "clip_ratio": 0.0,
1378
+ "completion_length": 15.000000953674316,
1379
+ "epoch": 1.3880597014925373,
1380
+ "grad_norm": 0.0,
1381
+ "learning_rate": 2.730047501302266e-07,
1382
+ "loss": 0.0,
1383
+ "num_tokens": 30531289.0,
1384
+ "reward": 0.9661169648170471,
1385
+ "reward_std": 0.0,
1386
+ "rewards/accuracy_reward": 0.0,
1387
+ "rewards/format_reward": 1.0,
1388
+ "rewards/log_scaled_reward": -0.03388310596346855,
1389
+ "step": 92
1390
+ },
1391
+ {
1392
+ "clip_ratio": 0.0,
1393
+ "completion_length": 15.000000953674316,
1394
+ "epoch": 1.4029850746268657,
1395
+ "grad_norm": 3.6358273029327393,
1396
+ "learning_rate": 2.6142061987019574e-07,
1397
+ "loss": -0.0,
1398
+ "num_tokens": 30664633.0,
1399
+ "reward": 0.966103158891201,
1400
+ "reward_std": 1.4753467439732049e-05,
1401
+ "rewards/accuracy_reward": 0.0,
1402
+ "rewards/format_reward": 1.0,
1403
+ "rewards/log_scaled_reward": -0.033896906301379204,
1404
+ "step": 93
1405
+ },
1406
+ {
1407
+ "clip_ratio": 0.0,
1408
+ "completion_length": 15.000000953674316,
1409
+ "epoch": 1.417910447761194,
1410
+ "grad_norm": 0.0,
1411
+ "learning_rate": 2.500000000000001e-07,
1412
+ "loss": 0.0,
1413
+ "num_tokens": 30820657.0,
1414
+ "reward": 0.9661169648170471,
1415
+ "reward_std": 0.0,
1416
+ "rewards/accuracy_reward": 0.0,
1417
+ "rewards/format_reward": 1.0,
1418
+ "rewards/log_scaled_reward": -0.03388310596346855,
1419
+ "step": 94
1420
+ },
1421
+ {
1422
+ "clip_ratio": 0.0,
1423
+ "completion_length": 15.000000953674316,
1424
+ "epoch": 1.4328358208955223,
1425
+ "grad_norm": 0.0,
1426
+ "learning_rate": 2.387507176420256e-07,
1427
+ "loss": 0.0,
1428
+ "num_tokens": 30965737.0,
1429
+ "reward": 0.9661169648170471,
1430
+ "reward_std": 0.0,
1431
+ "rewards/accuracy_reward": 0.0,
1432
+ "rewards/format_reward": 1.0,
1433
+ "rewards/log_scaled_reward": -0.03388310596346855,
1434
+ "step": 95
1435
+ },
1436
+ {
1437
+ "clip_ratio": 0.0,
1438
+ "completion_length": 15.00111699104309,
1439
+ "epoch": 1.4477611940298507,
1440
+ "grad_norm": 32.66771697998047,
1441
+ "learning_rate": 2.2768048249248644e-07,
1442
+ "loss": -0.0002,
1443
+ "num_tokens": 31108402.0,
1444
+ "reward": 0.9649775922298431,
1445
+ "reward_std": 0.0031557143665850163,
1446
+ "rewards/accuracy_reward": 0.0,
1447
+ "rewards/format_reward": 0.9988839253783226,
1448
+ "rewards/log_scaled_reward": -0.03390640066936612,
1449
+ "step": 96
1450
+ },
1451
+ {
1452
+ "clip_ratio": 0.0,
1453
+ "completion_length": 15.000000953674316,
1454
+ "epoch": 1.462686567164179,
1455
+ "grad_norm": 0.0,
1456
+ "learning_rate": 2.167968815375837e-07,
1457
+ "loss": 0.0,
1458
+ "num_tokens": 31257634.0,
1459
+ "reward": 0.9661169648170471,
1460
+ "reward_std": 0.0,
1461
+ "rewards/accuracy_reward": 0.0,
1462
+ "rewards/format_reward": 1.0,
1463
+ "rewards/log_scaled_reward": -0.03388310596346855,
1464
+ "step": 97
1465
+ },
1466
+ {
1467
+ "clip_ratio": 0.0,
1468
+ "completion_length": 15.000000953674316,
1469
+ "epoch": 1.4776119402985075,
1470
+ "grad_norm": 0.0,
1471
+ "learning_rate": 2.0610737385376348e-07,
1472
+ "loss": 0.0,
1473
+ "num_tokens": 31412970.0,
1474
+ "reward": 0.9661169648170471,
1475
+ "reward_std": 0.0,
1476
+ "rewards/accuracy_reward": 0.0,
1477
+ "rewards/format_reward": 1.0,
1478
+ "rewards/log_scaled_reward": -0.03388310596346855,
1479
+ "step": 98
1480
+ },
1481
+ {
1482
+ "clip_ratio": 0.0,
1483
+ "completion_length": 15.000000953674316,
1484
+ "epoch": 1.4925373134328357,
1485
+ "grad_norm": 0.0,
1486
+ "learning_rate": 1.9561928549563966e-07,
1487
+ "loss": 0.0,
1488
+ "num_tokens": 31564498.0,
1489
+ "reward": 0.9661169648170471,
1490
+ "reward_std": 0.0,
1491
+ "rewards/accuracy_reward": 0.0,
1492
+ "rewards/format_reward": 1.0,
1493
+ "rewards/log_scaled_reward": -0.03388310596346855,
1494
+ "step": 99
1495
+ },
1496
+ {
1497
+ "epoch": 1.5074626865671643,
1498
+ "grad_norm": 0.0,
1499
+ "learning_rate": 1.8533980447508135e-07,
1500
+ "loss": 0.0,
1501
+ "step": 100
1502
+ },
1503
+ {
1504
+ "epoch": 1.5074626865671643,
1505
+ "eval_clip_ratio": 0.0,
1506
+ "eval_completion_length": 15.000150587305676,
1507
+ "eval_loss": 0.0,
1508
+ "eval_num_tokens": 31697802.0,
1509
+ "eval_reward": 0.9661169648170471,
1510
+ "eval_reward_std": 0.0,
1511
+ "eval_rewards/accuracy_reward": 0.0,
1512
+ "eval_rewards/format_reward": 1.0,
1513
+ "eval_rewards/log_scaled_reward": -0.03388310596346855,
1514
+ "eval_runtime": 724.0402,
1515
+ "eval_samples_per_second": 6.906,
1516
+ "eval_steps_per_second": 0.062,
1517
+ "step": 100
1518
+ },
1519
+ {
1520
+ "clip_ratio": 0.0,
1521
+ "completion_length": 15.000000953674316,
1522
+ "epoch": 1.5223880597014925,
1523
+ "grad_norm": 0.0,
1524
+ "learning_rate": 1.7527597583490823e-07,
1525
+ "loss": 0.0,
1526
+ "num_tokens": 31837018.0,
1527
+ "reward": 0.9707325138151646,
1528
+ "reward_std": 0.0,
1529
+ "rewards/accuracy_reward": 0.0,
1530
+ "rewards/format_reward": 1.0,
1531
+ "rewards/log_scaled_reward": -0.029267556266859174,
1532
+ "step": 101
1533
+ },
1534
+ {
1535
+ "clip_ratio": 0.0,
1536
+ "completion_length": 15.000000953674316,
1537
+ "epoch": 1.537313432835821,
1538
+ "grad_norm": 0.0,
1539
+ "learning_rate": 1.6543469682057104e-07,
1540
+ "loss": 0.0,
1541
+ "num_tokens": 31986642.0,
1542
+ "reward": 0.9661169648170471,
1543
+ "reward_std": 0.0,
1544
+ "rewards/accuracy_reward": 0.0,
1545
+ "rewards/format_reward": 1.0,
1546
+ "rewards/log_scaled_reward": -0.03388310596346855,
1547
+ "step": 102
1548
+ },
1549
+ {
1550
+ "clip_ratio": 0.0,
1551
+ "completion_length": 15.000000953674316,
1552
+ "epoch": 1.5522388059701493,
1553
+ "grad_norm": 0.0,
1554
+ "learning_rate": 1.5582271215312293e-07,
1555
+ "loss": 0.0,
1556
+ "num_tokens": 32126642.0,
1557
+ "reward": 0.9661169648170471,
1558
+ "reward_std": 0.0,
1559
+ "rewards/accuracy_reward": 0.0,
1560
+ "rewards/format_reward": 1.0,
1561
+ "rewards/log_scaled_reward": -0.03388310596346855,
1562
+ "step": 103
1563
+ },
1564
+ {
1565
+ "clip_ratio": 0.0,
1566
+ "completion_length": 15.000000953674316,
1567
+ "epoch": 1.5671641791044775,
1568
+ "grad_norm": 0.0,
1569
+ "learning_rate": 1.4644660940672627e-07,
1570
+ "loss": 0.0,
1571
+ "num_tokens": 32276154.0,
1572
+ "reward": 0.9661169648170471,
1573
+ "reward_std": 0.0,
1574
+ "rewards/accuracy_reward": 0.0,
1575
+ "rewards/format_reward": 1.0,
1576
+ "rewards/log_scaled_reward": -0.03388310596346855,
1577
+ "step": 104
1578
+ },
1579
+ {
1580
+ "clip_ratio": 0.0,
1581
+ "completion_length": 15.000000953674316,
1582
+ "epoch": 1.582089552238806,
1583
+ "grad_norm": 0.0,
1584
+ "learning_rate": 1.3731281449385628e-07,
1585
+ "loss": 0.0,
1586
+ "num_tokens": 32426898.0,
1587
+ "reward": 0.9661169648170471,
1588
+ "reward_std": 0.0,
1589
+ "rewards/accuracy_reward": 0.0,
1590
+ "rewards/format_reward": 1.0,
1591
+ "rewards/log_scaled_reward": -0.03388310596346855,
1592
+ "step": 105
1593
+ },
1594
+ {
1595
+ "clip_ratio": 0.0,
1596
+ "completion_length": 15.000000953674316,
1597
+ "epoch": 1.5970149253731343,
1598
+ "grad_norm": 0.0,
1599
+ "learning_rate": 1.284275872613028e-07,
1600
+ "loss": 0.0,
1601
+ "num_tokens": 32571226.0,
1602
+ "reward": 0.9661169648170471,
1603
+ "reward_std": 0.0,
1604
+ "rewards/accuracy_reward": 0.0,
1605
+ "rewards/format_reward": 1.0,
1606
+ "rewards/log_scaled_reward": -0.03388310596346855,
1607
+ "step": 106
1608
+ },
1609
+ {
1610
+ "clip_ratio": 0.0,
1611
+ "completion_length": 15.000000953674316,
1612
+ "epoch": 1.6119402985074627,
1613
+ "grad_norm": 0.0,
1614
+ "learning_rate": 1.1979701719998454e-07,
1615
+ "loss": 0.0,
1616
+ "num_tokens": 32719426.0,
1617
+ "reward": 0.9661169648170471,
1618
+ "reward_std": 0.0,
1619
+ "rewards/accuracy_reward": 0.0,
1620
+ "rewards/format_reward": 1.0,
1621
+ "rewards/log_scaled_reward": -0.03388310596346855,
1622
+ "step": 107
1623
+ },
1624
+ {
1625
+ "clip_ratio": 0.0,
1626
+ "completion_length": 15.000000953674316,
1627
+ "epoch": 1.626865671641791,
1628
+ "grad_norm": 0.0,
1629
+ "learning_rate": 1.1142701927151454e-07,
1630
+ "loss": 0.0,
1631
+ "num_tokens": 32852794.0,
1632
+ "reward": 0.9661169648170471,
1633
+ "reward_std": 0.0,
1634
+ "rewards/accuracy_reward": 0.0,
1635
+ "rewards/format_reward": 1.0,
1636
+ "rewards/log_scaled_reward": -0.03388310596346855,
1637
+ "step": 108
1638
+ },
1639
+ {
1640
+ "clip_ratio": 0.0,
1641
+ "completion_length": 15.000000953674316,
1642
+ "epoch": 1.6417910447761193,
1643
+ "grad_norm": 0.0,
1644
+ "learning_rate": 1.0332332985438247e-07,
1645
+ "loss": 0.0,
1646
+ "num_tokens": 33000498.0,
1647
+ "reward": 0.9661169648170471,
1648
+ "reward_std": 0.0,
1649
+ "rewards/accuracy_reward": 0.0,
1650
+ "rewards/format_reward": 1.0,
1651
+ "rewards/log_scaled_reward": -0.03388310596346855,
1652
+ "step": 109
1653
+ },
1654
+ {
1655
+ "clip_ratio": 0.0,
1656
+ "completion_length": 15.000000953674316,
1657
+ "epoch": 1.6567164179104479,
1658
+ "grad_norm": 0.0,
1659
+ "learning_rate": 9.549150281252632e-08,
1660
+ "loss": 0.0,
1661
+ "num_tokens": 33139850.0,
1662
+ "reward": 0.9661169648170471,
1663
+ "reward_std": 0.0,
1664
+ "rewards/accuracy_reward": 0.0,
1665
+ "rewards/format_reward": 1.0,
1666
+ "rewards/log_scaled_reward": -0.03388310596346855,
1667
+ "step": 110
1668
+ },
1669
+ {
1670
+ "clip_ratio": 0.0,
1671
+ "completion_length": 15.000000953674316,
1672
+ "epoch": 1.671641791044776,
1673
+ "grad_norm": 0.0,
1674
+ "learning_rate": 8.793690568899215e-08,
1675
+ "loss": 0.0,
1676
+ "num_tokens": 33276866.0,
1677
+ "reward": 0.9661169648170471,
1678
+ "reward_std": 0.0,
1679
+ "rewards/accuracy_reward": 0.0,
1680
+ "rewards/format_reward": 1.0,
1681
+ "rewards/log_scaled_reward": -0.03388310596346855,
1682
+ "step": 111
1683
+ },
1684
+ {
1685
+ "clip_ratio": 0.0,
1686
+ "completion_length": 15.000000953674316,
1687
+ "epoch": 1.6865671641791045,
1688
+ "grad_norm": 0.0,
1689
+ "learning_rate": 8.066471602728803e-08,
1690
+ "loss": 0.0,
1691
+ "num_tokens": 33407970.0,
1692
+ "reward": 0.9661169648170471,
1693
+ "reward_std": 0.0,
1694
+ "rewards/accuracy_reward": 0.0,
1695
+ "rewards/format_reward": 1.0,
1696
+ "rewards/log_scaled_reward": -0.03388310596346855,
1697
+ "step": 112
1698
+ },
1699
+ {
1700
+ "clip_ratio": 0.0,
1701
+ "completion_length": 15.000000953674316,
1702
+ "epoch": 1.7014925373134329,
1703
+ "grad_norm": 0.0,
1704
+ "learning_rate": 7.36799178229539e-08,
1705
+ "loss": 0.0,
1706
+ "num_tokens": 33549218.0,
1707
+ "reward": 0.9661169648170471,
1708
+ "reward_std": 0.0,
1709
+ "rewards/accuracy_reward": 0.0,
1710
+ "rewards/format_reward": 1.0,
1711
+ "rewards/log_scaled_reward": -0.03388310596346855,
1712
+ "step": 113
1713
+ },
1714
+ {
1715
+ "clip_ratio": 0.0,
1716
+ "completion_length": 15.000000953674316,
1717
+ "epoch": 1.716417910447761,
1718
+ "grad_norm": 0.0,
1719
+ "learning_rate": 6.698729810778064e-08,
1720
+ "loss": 0.0,
1721
+ "num_tokens": 33694498.0,
1722
+ "reward": 0.9661169648170471,
1723
+ "reward_std": 0.0,
1724
+ "rewards/accuracy_reward": 0.0,
1725
+ "rewards/format_reward": 1.0,
1726
+ "rewards/log_scaled_reward": -0.03388310596346855,
1727
+ "step": 114
1728
+ },
1729
+ {
1730
+ "clip_ratio": 0.0,
1731
+ "completion_length": 15.000000953674316,
1732
+ "epoch": 1.7313432835820897,
1733
+ "grad_norm": 0.0,
1734
+ "learning_rate": 6.059144366901736e-08,
1735
+ "loss": 0.0,
1736
+ "num_tokens": 33836482.0,
1737
+ "reward": 0.9661169648170471,
1738
+ "reward_std": 0.0,
1739
+ "rewards/accuracy_reward": 0.0,
1740
+ "rewards/format_reward": 1.0,
1741
+ "rewards/log_scaled_reward": -0.03388310596346855,
1742
+ "step": 115
1743
+ },
1744
+ {
1745
+ "clip_ratio": 0.0,
1746
+ "completion_length": 15.000000953674316,
1747
+ "epoch": 1.7462686567164178,
1748
+ "grad_norm": 0.0,
1749
+ "learning_rate": 5.44967379058161e-08,
1750
+ "loss": 0.0,
1751
+ "num_tokens": 33970506.0,
1752
+ "reward": 0.9661169648170471,
1753
+ "reward_std": 0.0,
1754
+ "rewards/accuracy_reward": 0.0,
1755
+ "rewards/format_reward": 1.0,
1756
+ "rewards/log_scaled_reward": -0.03388310596346855,
1757
+ "step": 116
1758
+ },
1759
+ {
1760
+ "clip_ratio": 0.0,
1761
+ "completion_length": 15.002233028411865,
1762
+ "epoch": 1.7611940298507462,
1763
+ "grad_norm": 0.0,
1764
+ "learning_rate": 4.870735782506979e-08,
1765
+ "loss": 0.0,
1766
+ "num_tokens": 34139396.0,
1767
+ "reward": 0.9661169648170471,
1768
+ "reward_std": 0.0,
1769
+ "rewards/accuracy_reward": 0.0,
1770
+ "rewards/format_reward": 1.0,
1771
+ "rewards/log_scaled_reward": -0.03388310596346855,
1772
+ "step": 117
1773
+ },
1774
+ {
1775
+ "clip_ratio": 0.0,
1776
+ "completion_length": 15.000000953674316,
1777
+ "epoch": 1.7761194029850746,
1778
+ "grad_norm": 0.0,
1779
+ "learning_rate": 4.322727117869951e-08,
1780
+ "loss": 0.0,
1781
+ "num_tokens": 34279244.0,
1782
+ "reward": 0.9661169648170471,
1783
+ "reward_std": 0.0,
1784
+ "rewards/accuracy_reward": 0.0,
1785
+ "rewards/format_reward": 1.0,
1786
+ "rewards/log_scaled_reward": -0.03388310596346855,
1787
+ "step": 118
1788
+ },
1789
+ {
1790
+ "clip_ratio": 0.0,
1791
+ "completion_length": 15.000000953674316,
1792
+ "epoch": 1.7910447761194028,
1793
+ "grad_norm": 0.0,
1794
+ "learning_rate": 3.806023374435663e-08,
1795
+ "loss": 0.0,
1796
+ "num_tokens": 34418092.0,
1797
+ "reward": 0.9661169648170471,
1798
+ "reward_std": 0.0,
1799
+ "rewards/accuracy_reward": 0.0,
1800
+ "rewards/format_reward": 1.0,
1801
+ "rewards/log_scaled_reward": -0.03388310596346855,
1802
+ "step": 119
1803
+ },
1804
+ {
1805
+ "clip_ratio": 0.0,
1806
+ "completion_length": 15.000000953674316,
1807
+ "epoch": 1.8059701492537314,
1808
+ "grad_norm": 0.0,
1809
+ "learning_rate": 3.3209786751399184e-08,
1810
+ "loss": 0.0,
1811
+ "num_tokens": 34570476.0,
1812
+ "reward": 0.9661169648170471,
1813
+ "reward_std": 0.0,
1814
+ "rewards/accuracy_reward": 0.0,
1815
+ "rewards/format_reward": 1.0,
1816
+ "rewards/log_scaled_reward": -0.03388310596346855,
1817
+ "step": 120
1818
+ },
1819
+ {
1820
+ "clip_ratio": 0.0,
1821
+ "completion_length": 15.000000953674316,
1822
+ "epoch": 1.8208955223880596,
1823
+ "grad_norm": 0.0,
1824
+ "learning_rate": 2.8679254453910785e-08,
1825
+ "loss": 0.0,
1826
+ "num_tokens": 34714220.0,
1827
+ "reward": 0.9661169648170471,
1828
+ "reward_std": 0.0,
1829
+ "rewards/accuracy_reward": 0.0,
1830
+ "rewards/format_reward": 1.0,
1831
+ "rewards/log_scaled_reward": -0.03388310596346855,
1832
+ "step": 121
1833
+ },
1834
+ {
1835
+ "clip_ratio": 0.0,
1836
+ "completion_length": 15.000000953674316,
1837
+ "epoch": 1.835820895522388,
1838
+ "grad_norm": 0.0,
1839
+ "learning_rate": 2.4471741852423233e-08,
1840
+ "loss": 0.0,
1841
+ "num_tokens": 34874452.0,
1842
+ "reward": 0.9661169648170471,
1843
+ "reward_std": 0.0,
1844
+ "rewards/accuracy_reward": 0.0,
1845
+ "rewards/format_reward": 1.0,
1846
+ "rewards/log_scaled_reward": -0.03388310596346855,
1847
+ "step": 122
1848
+ },
1849
+ {
1850
+ "clip_ratio": 0.0,
1851
+ "completion_length": 15.000000953674316,
1852
+ "epoch": 1.8507462686567164,
1853
+ "grad_norm": 0.0,
1854
+ "learning_rate": 2.0590132565903473e-08,
1855
+ "loss": 0.0,
1856
+ "num_tokens": 35020932.0,
1857
+ "reward": 0.9661169648170471,
1858
+ "reward_std": 0.0,
1859
+ "rewards/accuracy_reward": 0.0,
1860
+ "rewards/format_reward": 1.0,
1861
+ "rewards/log_scaled_reward": -0.03388310596346855,
1862
+ "step": 123
1863
+ },
1864
+ {
1865
+ "clip_ratio": 0.0,
1866
+ "completion_length": 15.000000953674316,
1867
+ "epoch": 1.8656716417910446,
1868
+ "grad_norm": 0.0,
1869
+ "learning_rate": 1.7037086855465898e-08,
1870
+ "loss": 0.0,
1871
+ "num_tokens": 35158836.0,
1872
+ "reward": 0.9661169648170471,
1873
+ "reward_std": 0.0,
1874
+ "rewards/accuracy_reward": 0.0,
1875
+ "rewards/format_reward": 1.0,
1876
+ "rewards/log_scaled_reward": -0.03388310596346855,
1877
+ "step": 124
1878
+ },
1879
+ {
1880
+ "clip_ratio": 0.0,
1881
+ "completion_length": 15.000000953674316,
1882
+ "epoch": 1.8805970149253732,
1883
+ "grad_norm": 0.0,
1884
+ "learning_rate": 1.3815039801161722e-08,
1885
+ "loss": 0.0,
1886
+ "num_tokens": 35299044.0,
1887
+ "reward": 0.9661169648170471,
1888
+ "reward_std": 0.0,
1889
+ "rewards/accuracy_reward": 0.0,
1890
+ "rewards/format_reward": 1.0,
1891
+ "rewards/log_scaled_reward": -0.03388310596346855,
1892
+ "step": 125
1893
+ },
1894
+ {
1895
+ "clip_ratio": 0.0,
1896
+ "completion_length": 15.000000953674316,
1897
+ "epoch": 1.8955223880597014,
1898
+ "grad_norm": 0.0,
1899
+ "learning_rate": 1.0926199633097154e-08,
1900
+ "loss": 0.0,
1901
+ "num_tokens": 35438660.0,
1902
+ "reward": 0.9661169648170471,
1903
+ "reward_std": 0.0,
1904
+ "rewards/accuracy_reward": 0.0,
1905
+ "rewards/format_reward": 1.0,
1906
+ "rewards/log_scaled_reward": -0.03388310596346855,
1907
+ "step": 126
1908
+ },
1909
+ {
1910
+ "clip_ratio": 0.0,
1911
+ "completion_length": 15.000000953674316,
1912
+ "epoch": 1.9104477611940298,
1913
+ "grad_norm": 0.0,
1914
+ "learning_rate": 8.372546218022746e-09,
1915
+ "loss": 0.0,
1916
+ "num_tokens": 35589036.0,
1917
+ "reward": 0.9661169648170471,
1918
+ "reward_std": 0.0,
1919
+ "rewards/accuracy_reward": 0.0,
1920
+ "rewards/format_reward": 1.0,
1921
+ "rewards/log_scaled_reward": -0.03388310596346855,
1922
+ "step": 127
1923
+ },
1924
+ {
1925
+ "clip_ratio": 0.0,
1926
+ "completion_length": 15.000000953674316,
1927
+ "epoch": 1.9253731343283582,
1928
+ "grad_norm": 0.0,
1929
+ "learning_rate": 6.15582970243117e-09,
1930
+ "loss": 0.0,
1931
+ "num_tokens": 35732844.0,
1932
+ "reward": 0.9661169648170471,
1933
+ "reward_std": 0.0,
1934
+ "rewards/accuracy_reward": 0.0,
1935
+ "rewards/format_reward": 1.0,
1936
+ "rewards/log_scaled_reward": -0.03388310596346855,
1937
+ "step": 128
1938
+ },
1939
+ {
1940
+ "clip_ratio": 0.0,
1941
+ "completion_length": 15.000000953674316,
1942
+ "epoch": 1.9402985074626866,
1943
+ "grad_norm": 0.0,
1944
+ "learning_rate": 4.277569313094809e-09,
1945
+ "loss": 0.0,
1946
+ "num_tokens": 35869612.0,
1947
+ "reward": 0.9661169648170471,
1948
+ "reward_std": 0.0,
1949
+ "rewards/accuracy_reward": 0.0,
1950
+ "rewards/format_reward": 1.0,
1951
+ "rewards/log_scaled_reward": -0.03388310596346855,
1952
+ "step": 129
1953
+ },
1954
+ {
1955
+ "clip_ratio": 0.0,
1956
+ "completion_length": 15.000000953674316,
1957
+ "epoch": 1.955223880597015,
1958
+ "grad_norm": 0.0,
1959
+ "learning_rate": 2.739052315863355e-09,
1960
+ "loss": 0.0,
1961
+ "num_tokens": 36024796.0,
1962
+ "reward": 0.9661169648170471,
1963
+ "reward_std": 0.0,
1964
+ "rewards/accuracy_reward": 0.0,
1965
+ "rewards/format_reward": 1.0,
1966
+ "rewards/log_scaled_reward": -0.03388310596346855,
1967
+ "step": 130
1968
+ },
1969
+ {
1970
+ "clip_ratio": 0.0,
1971
+ "completion_length": 15.000000953674316,
1972
+ "epoch": 1.9701492537313432,
1973
+ "grad_norm": 0.0,
1974
+ "learning_rate": 1.541333133436018e-09,
1975
+ "loss": 0.0,
1976
+ "num_tokens": 36165404.0,
1977
+ "reward": 0.9661169648170471,
1978
+ "reward_std": 0.0,
1979
+ "rewards/accuracy_reward": 0.0,
1980
+ "rewards/format_reward": 1.0,
1981
+ "rewards/log_scaled_reward": -0.03388310596346855,
1982
+ "step": 131
1983
+ },
1984
+ {
1985
+ "clip_ratio": 0.0,
1986
+ "completion_length": 15.0,
1987
+ "epoch": 1.9850746268656716,
1988
+ "grad_norm": 0.0,
1989
+ "learning_rate": 6.852326227130833e-10,
1990
+ "loss": 0.0,
1991
+ "num_tokens": 36316348.0,
1992
+ "reward": 0.9661169648170471,
1993
+ "reward_std": 0.0,
1994
+ "rewards/accuracy_reward": 0.0,
1995
+ "rewards/format_reward": 1.0,
1996
+ "rewards/log_scaled_reward": -0.03388310596346855,
1997
+ "step": 132
1998
+ },
1999
+ {
2000
+ "epoch": 1.9850746268656716,
2001
+ "step": 132,
2002
+ "total_flos": 0.0,
2003
+ "train_loss": 0.04579740530018937,
2004
+ "train_runtime": 16666.94,
2005
+ "train_samples_per_second": 0.9,
2006
+ "train_steps_per_second": 0.008
2007
+ }
2008
+ ],
2009
+ "logging_steps": 1,
2010
+ "max_steps": 134,
2011
+ "num_input_tokens_seen": 0,
2012
+ "num_train_epochs": 2,
2013
+ "save_steps": 500,
2014
+ "stateful_callbacks": {
2015
+ "TrainerControl": {
2016
+ "args": {
2017
+ "should_epoch_stop": false,
2018
+ "should_evaluate": false,
2019
+ "should_log": false,
2020
+ "should_save": true,
2021
+ "should_training_stop": false
2022
+ },
2023
+ "attributes": {}
2024
+ }
2025
+ },
2026
+ "total_flos": 0.0,
2027
+ "train_batch_size": 16,
2028
+ "trial_name": null,
2029
+ "trial_params": null
2030
+ }