jbenbudd commited on
Commit
7ba0b05
·
1 Parent(s): ed92ab1

Initial commit of the LoRA/adapter model

Browse files
README.md CHANGED
@@ -1,25 +1,25 @@
1
  ---
2
  library_name: peft
3
  license: other
4
- base_model: GreatCaptainNemo/ProLLaMA_Stage_1
5
  tags:
6
  - llama-factory
7
  - lora
8
  - generated_from_trainer
9
  model-index:
10
- - name: train_2025-03-11-22-40-04
11
  results: []
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
17
- # train_2025-03-11-22-40-04
18
 
19
- This model is a fine-tuned version of [GreatCaptainNemo/ProLLaMA_Stage_1](https://huggingface.co/GreatCaptainNemo/ProLLaMA_Stage_1) on the adpr_train dataset.
20
  It achieves the following results on the evaluation set:
21
- - Loss: 0.0947
22
- - Num Input Tokens Seen: 8867536
23
 
24
  ## Model description
25
 
@@ -39,11 +39,11 @@ More information needed
39
 
40
  The following hyperparameters were used during training:
41
  - learning_rate: 5e-05
42
- - train_batch_size: 8
43
- - eval_batch_size: 8
44
  - seed: 42
45
  - gradient_accumulation_steps: 8
46
- - total_train_batch_size: 64
47
  - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
  - lr_scheduler_type: cosine
49
  - lr_scheduler_warmup_steps: 20
@@ -53,26 +53,18 @@ The following hyperparameters were used during training:
53
 
54
  | Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
55
  |:-------------:|:------:|:----:|:---------------:|:-----------------:|
56
- | 0.1778 | 0.2114 | 100 | 0.1754 | 624768 |
57
- | 0.1668 | 0.4228 | 200 | 0.1641 | 1249984 |
58
- | 0.1569 | 0.6342 | 300 | 0.1600 | 1875648 |
59
- | 0.1313 | 0.8457 | 400 | 0.1339 | 2500800 |
60
- | 0.1134 | 1.0571 | 500 | 0.1193 | 3124224 |
61
- | 0.1059 | 1.2685 | 600 | 0.1088 | 3750336 |
62
- | 0.096 | 1.4799 | 700 | 0.1083 | 4375808 |
63
- | 0.0998 | 1.6913 | 800 | 0.1001 | 5000128 |
64
- | 0.1083 | 1.9027 | 900 | 0.0991 | 5624576 |
65
- | 0.0953 | 2.1142 | 1000 | 0.0972 | 6248320 |
66
- | 0.0887 | 2.3256 | 1100 | 0.0964 | 6873152 |
67
- | 0.0889 | 2.5370 | 1200 | 0.0954 | 7498688 |
68
- | 0.0859 | 2.7484 | 1300 | 0.0950 | 8124864 |
69
- | 0.0883 | 2.9598 | 1400 | 0.0947 | 8749760 |
70
 
71
 
72
  ### Framework versions
73
 
74
- - PEFT 0.12.0
75
- - Transformers 4.48.3
76
  - Pytorch 2.3.1+cu121
77
- - Datasets 3.3.2
78
  - Tokenizers 0.21.0
 
1
  ---
2
  library_name: peft
3
  license: other
4
+ base_model: GreatCaptainNemo/ProLLaMA
5
  tags:
6
  - llama-factory
7
  - lora
8
  - generated_from_trainer
9
  model-index:
10
+ - name: train_2025-04-05-23-57-03
11
  results: []
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
17
+ # train_2025-04-05-23-57-03
18
 
19
+ This model is a fine-tuned version of [GreatCaptainNemo/ProLLaMA](https://huggingface.co/GreatCaptainNemo/ProLLaMA) on the adpr_train dataset.
20
  It achieves the following results on the evaluation set:
21
+ - Loss: 0.2991
22
+ - Num Input Tokens Seen: 8057088
23
 
24
  ## Model description
25
 
 
39
 
40
  The following hyperparameters were used during training:
41
  - learning_rate: 5e-05
42
+ - train_batch_size: 16
43
+ - eval_batch_size: 16
44
  - seed: 42
45
  - gradient_accumulation_steps: 8
46
+ - total_train_batch_size: 128
47
  - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
  - lr_scheduler_type: cosine
49
  - lr_scheduler_warmup_steps: 20
 
53
 
54
  | Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
55
  |:-------------:|:------:|:----:|:---------------:|:-----------------:|
56
+ | 0.46 | 0.4561 | 100 | 0.4706 | 1229824 |
57
+ | 0.4222 | 0.9122 | 200 | 0.4173 | 2457344 |
58
+ | 0.382 | 1.3649 | 300 | 0.3807 | 3679728 |
59
+ | 0.3574 | 1.8210 | 400 | 0.3323 | 4908144 |
60
+ | 0.311 | 2.2737 | 500 | 0.3114 | 6131072 |
61
+ | 0.2808 | 2.7298 | 600 | 0.3001 | 7358336 |
 
 
 
 
 
 
 
 
62
 
63
 
64
  ### Framework versions
65
 
66
+ - PEFT 0.14.0
67
+ - Transformers 4.50.0
68
  - Pytorch 2.3.1+cu121
69
+ - Datasets 3.4.1
70
  - Tokenizers 0.21.0
adapter_config.json CHANGED
@@ -1,8 +1,10 @@
1
  {
2
  "alpha_pattern": {},
3
  "auto_mapping": null,
4
- "base_model_name_or_path": "GreatCaptainNemo/ProLLaMA_Stage_1",
5
  "bias": "none",
 
 
6
  "fan_in_fan_out": false,
7
  "inference_mode": true,
8
  "init_lora_weights": true,
@@ -11,22 +13,26 @@
11
  "layers_to_transform": null,
12
  "loftq_config": {},
13
  "lora_alpha": 128,
 
14
  "lora_dropout": 0.01,
15
  "megatron_config": null,
16
  "megatron_core": "megatron.core",
17
- "modules_to_save": null,
 
 
 
18
  "peft_type": "LORA",
19
  "r": 64,
20
  "rank_pattern": {},
21
  "revision": null,
22
  "target_modules": [
23
- "q_proj",
24
- "up_proj",
25
- "gate_proj",
26
  "v_proj",
 
 
27
  "o_proj",
28
- "down_proj",
29
- "k_proj"
 
30
  ],
31
  "task_type": "CAUSAL_LM",
32
  "use_dora": false,
 
1
  {
2
  "alpha_pattern": {},
3
  "auto_mapping": null,
4
+ "base_model_name_or_path": "GreatCaptainNemo/ProLLaMA",
5
  "bias": "none",
6
+ "eva_config": null,
7
+ "exclude_modules": null,
8
  "fan_in_fan_out": false,
9
  "inference_mode": true,
10
  "init_lora_weights": true,
 
13
  "layers_to_transform": null,
14
  "loftq_config": {},
15
  "lora_alpha": 128,
16
+ "lora_bias": false,
17
  "lora_dropout": 0.01,
18
  "megatron_config": null,
19
  "megatron_core": "megatron.core",
20
+ "modules_to_save": [
21
+ "lm_head",
22
+ "embed_tokens"
23
+ ],
24
  "peft_type": "LORA",
25
  "r": 64,
26
  "rank_pattern": {},
27
  "revision": null,
28
  "target_modules": [
 
 
 
29
  "v_proj",
30
+ "up_proj",
31
+ "k_proj",
32
  "o_proj",
33
+ "q_proj",
34
+ "gate_proj",
35
+ "down_proj"
36
  ],
37
  "task_type": "CAUSAL_LM",
38
  "use_dora": false,
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ec8217f8814ec8fec1a7afe03a712ccc65a6bc6150d75f40338d02609d6edcd
3
- size 639691872
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1fd7dafc924d48955fd6dc8c3614d1473e067f0e918024bf03c1ce3a70677197
3
+ size 1688269144
all_results.json CHANGED
@@ -1,13 +1,13 @@
1
  {
2
- "epoch": 3.0,
3
- "eval_loss": 0.09474755078554153,
4
- "eval_runtime": 40.4682,
5
- "eval_samples_per_second": 83.102,
6
- "eval_steps_per_second": 10.403,
7
- "num_input_tokens_seen": 8867536,
8
- "total_flos": 3.600530754427945e+17,
9
- "train_loss": 0.2131162985812786,
10
- "train_runtime": 4701.6415,
11
- "train_samples_per_second": 19.312,
12
- "train_steps_per_second": 0.302
13
  }
 
1
  {
2
+ "epoch": 2.9897377423033067,
3
+ "eval_loss": 0.29908037185668945,
4
+ "eval_runtime": 34.2949,
5
+ "eval_samples_per_second": 90.917,
6
+ "eval_steps_per_second": 5.686,
7
+ "num_input_tokens_seen": 8057088,
8
+ "total_flos": 3.334823948247368e+17,
9
+ "train_loss": 0.48195079037043603,
10
+ "train_runtime": 3553.5571,
11
+ "train_samples_per_second": 23.687,
12
+ "train_steps_per_second": 0.185
13
  }
eval_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
- "epoch": 3.0,
3
- "eval_loss": 0.09474755078554153,
4
- "eval_runtime": 40.4682,
5
- "eval_samples_per_second": 83.102,
6
- "eval_steps_per_second": 10.403,
7
- "num_input_tokens_seen": 8867536
8
  }
 
1
  {
2
+ "epoch": 2.9897377423033067,
3
+ "eval_loss": 0.29908037185668945,
4
+ "eval_runtime": 34.2949,
5
+ "eval_samples_per_second": 90.917,
6
+ "eval_steps_per_second": 5.686,
7
+ "num_input_tokens_seen": 8057088
8
  }
llamaboard_config.yaml CHANGED
@@ -15,7 +15,7 @@ train.badam_mode: layer
15
  train.badam_switch_interval: 50
16
  train.badam_switch_mode: ascending
17
  train.badam_update_ratio: 0.05
18
- train.batch_size: 8
19
  train.compute_type: bf16
20
  train.create_new_adapter: false
21
  train.cutoff_len: 2048
@@ -55,7 +55,7 @@ train.pref_ftx: 0
55
  train.pref_loss: sigmoid
56
  train.report_to:
57
  - none
58
- train.resize_vocab: false
59
  train.reward_model: []
60
  train.save_steps: 100
61
  train.swanlab_api_key: ''
 
15
  train.badam_switch_interval: 50
16
  train.badam_switch_mode: ascending
17
  train.badam_update_ratio: 0.05
18
+ train.batch_size: 16
19
  train.compute_type: bf16
20
  train.create_new_adapter: false
21
  train.cutoff_len: 2048
 
55
  train.pref_loss: sigmoid
56
  train.report_to:
57
  - none
58
+ train.resize_vocab: true
59
  train.reward_model: []
60
  train.save_steps: 100
61
  train.swanlab_api_key: ''
model_eval_results.csv CHANGED
The diff for this file is too large to render. See raw diff
 
running_log.txt CHANGED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json CHANGED
@@ -13,7 +13,13 @@
13
  "rstrip": false,
14
  "single_word": false
15
  },
16
- "pad_token": "</s>",
 
 
 
 
 
 
17
  "unk_token": {
18
  "content": "<unk>",
19
  "lstrip": false,
 
13
  "rstrip": false,
14
  "single_word": false
15
  },
16
+ "pad_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
  "unk_token": {
24
  "content": "<unk>",
25
  "lstrip": false,
tokenizer_config.json CHANGED
@@ -34,7 +34,7 @@
34
  "eos_token": "</s>",
35
  "extra_special_tokens": {},
36
  "legacy": true,
37
- "model_max_length": 2048,
38
  "pad_token": "</s>",
39
  "padding_side": "right",
40
  "sp_model_kwargs": {},
 
34
  "eos_token": "</s>",
35
  "extra_special_tokens": {},
36
  "legacy": true,
37
+ "model_max_length": 1000000000000000019884624838656,
38
  "pad_token": "</s>",
39
  "padding_side": "right",
40
  "sp_model_kwargs": {},
trainer_log.jsonl CHANGED
The diff for this file is too large to render. See raw diff
 
trainer_state.json CHANGED
@@ -1,2417 +1,1130 @@
1
  {
 
2
  "best_metric": null,
3
  "best_model_checkpoint": null,
4
- "epoch": 3.0,
5
  "eval_steps": 100,
6
- "global_step": 1419,
7
  "is_hyper_param_search": false,
8
  "is_local_process_zero": true,
9
  "is_world_process_zero": true,
10
  "log_history": [
11
  {
12
- "epoch": 0.010570824524312896,
13
- "grad_norm": 45.65249252319336,
14
  "learning_rate": 1.25e-05,
15
- "loss": 14.2333,
16
- "num_input_tokens_seen": 31104,
17
  "step": 5
18
  },
19
  {
20
- "epoch": 0.021141649048625793,
21
- "grad_norm": 33.49619674682617,
22
  "learning_rate": 2.5e-05,
23
- "loss": 9.2972,
24
- "num_input_tokens_seen": 62208,
25
  "step": 10
26
  },
27
  {
28
- "epoch": 0.03171247357293869,
29
- "grad_norm": 9.210739135742188,
30
  "learning_rate": 3.7500000000000003e-05,
31
- "loss": 2.411,
32
- "num_input_tokens_seen": 93504,
33
  "step": 15
34
  },
35
  {
36
- "epoch": 0.042283298097251586,
37
- "grad_norm": 7.316084384918213,
38
  "learning_rate": 5e-05,
39
- "loss": 0.9413,
40
- "num_input_tokens_seen": 124800,
41
  "step": 20
42
  },
43
  {
44
- "epoch": 0.052854122621564484,
45
- "grad_norm": 7.541203498840332,
46
- "learning_rate": 4.9998424168507275e-05,
47
- "loss": 0.4389,
48
- "num_input_tokens_seen": 156096,
49
  "step": 25
50
  },
51
  {
52
- "epoch": 0.06342494714587738,
53
- "grad_norm": 7.138961315155029,
54
- "learning_rate": 4.999369687268868e-05,
55
- "loss": 0.4112,
56
- "num_input_tokens_seen": 187200,
57
  "step": 30
58
  },
59
  {
60
- "epoch": 0.07399577167019028,
61
- "grad_norm": 0.46440309286117554,
62
- "learning_rate": 4.998581870849795e-05,
63
- "loss": 0.3011,
64
- "num_input_tokens_seen": 218496,
65
  "step": 35
66
  },
67
  {
68
- "epoch": 0.08456659619450317,
69
- "grad_norm": 1.6051653623580933,
70
- "learning_rate": 4.997479066910782e-05,
71
- "loss": 0.2631,
72
- "num_input_tokens_seen": 249920,
73
  "step": 40
74
  },
75
  {
76
- "epoch": 0.09513742071881606,
77
- "grad_norm": 1.2404223680496216,
78
- "learning_rate": 4.996061414478485e-05,
79
- "loss": 0.2223,
80
- "num_input_tokens_seen": 281216,
81
  "step": 45
82
  },
83
  {
84
- "epoch": 0.10570824524312897,
85
- "grad_norm": 0.2932145297527313,
86
- "learning_rate": 4.994329092271408e-05,
87
- "loss": 0.2446,
88
- "num_input_tokens_seen": 312512,
89
  "step": 50
90
  },
91
  {
92
- "epoch": 0.11627906976744186,
93
- "grad_norm": 7.334479331970215,
94
- "learning_rate": 4.992282318677387e-05,
95
- "loss": 0.2994,
96
- "num_input_tokens_seen": 343680,
97
  "step": 55
98
  },
99
  {
100
- "epoch": 0.12684989429175475,
101
- "grad_norm": 1.8428316116333008,
102
- "learning_rate": 4.9899213517260416e-05,
103
- "loss": 0.2916,
104
- "num_input_tokens_seen": 374848,
105
  "step": 60
106
  },
107
  {
108
- "epoch": 0.13742071881606766,
109
- "grad_norm": 0.9089680314064026,
110
- "learning_rate": 4.9872464890562576e-05,
111
- "loss": 0.2317,
112
- "num_input_tokens_seen": 406400,
113
  "step": 65
114
  },
115
  {
116
- "epoch": 0.14799154334038056,
117
- "grad_norm": 4.8257880210876465,
118
- "learning_rate": 4.9842580678786645e-05,
119
- "loss": 0.2216,
120
- "num_input_tokens_seen": 437696,
121
  "step": 70
122
  },
123
  {
124
- "epoch": 0.15856236786469344,
125
- "grad_norm": 0.614710807800293,
126
- "learning_rate": 4.980956464933116e-05,
127
- "loss": 0.2311,
128
- "num_input_tokens_seen": 468864,
129
  "step": 75
130
  },
131
  {
132
- "epoch": 0.16913319238900634,
133
- "grad_norm": 1.1520471572875977,
134
- "learning_rate": 4.9773420964412064e-05,
135
- "loss": 0.2051,
136
- "num_input_tokens_seen": 499968,
137
  "step": 80
138
  },
139
  {
140
- "epoch": 0.17970401691331925,
141
- "grad_norm": 0.8753998279571533,
142
- "learning_rate": 4.973415418053789e-05,
143
- "loss": 0.1928,
144
- "num_input_tokens_seen": 531072,
145
  "step": 85
146
  },
147
  {
148
- "epoch": 0.19027484143763213,
149
- "grad_norm": 0.2460280954837799,
150
- "learning_rate": 4.969176924793543e-05,
151
- "loss": 0.1849,
152
- "num_input_tokens_seen": 562240,
153
  "step": 90
154
  },
155
  {
156
- "epoch": 0.20084566596194503,
157
- "grad_norm": 0.22848260402679443,
158
- "learning_rate": 4.96462715099256e-05,
159
- "loss": 0.172,
160
- "num_input_tokens_seen": 593536,
161
  "step": 95
162
  },
163
  {
164
- "epoch": 0.21141649048625794,
165
- "grad_norm": 0.4746881127357483,
166
- "learning_rate": 4.9597666702249865e-05,
167
- "loss": 0.1778,
168
- "num_input_tokens_seen": 624768,
169
  "step": 100
170
  },
171
  {
172
- "epoch": 0.21141649048625794,
173
- "eval_loss": 0.17541779577732086,
174
- "eval_runtime": 40.3512,
175
- "eval_samples_per_second": 83.343,
176
- "eval_steps_per_second": 10.433,
177
- "num_input_tokens_seen": 624768,
178
  "step": 100
179
  },
180
  {
181
- "epoch": 0.2219873150105708,
182
- "grad_norm": 0.2084885537624359,
183
- "learning_rate": 4.954596095234718e-05,
184
- "loss": 0.1754,
185
- "num_input_tokens_seen": 656256,
186
  "step": 105
187
  },
188
  {
189
- "epoch": 0.23255813953488372,
190
- "grad_norm": 0.10341060161590576,
191
- "learning_rate": 4.9491160778581445e-05,
192
- "loss": 0.1727,
193
- "num_input_tokens_seen": 687808,
194
  "step": 110
195
  },
196
  {
197
- "epoch": 0.24312896405919662,
198
- "grad_norm": 11.542490005493164,
199
- "learning_rate": 4.943327308941985e-05,
200
- "loss": 0.1728,
201
- "num_input_tokens_seen": 718848,
202
  "step": 115
203
  },
204
  {
205
- "epoch": 0.2536997885835095,
206
- "grad_norm": 0.07902055978775024,
207
- "learning_rate": 4.9372305182561874e-05,
208
- "loss": 0.1649,
209
- "num_input_tokens_seen": 750080,
210
  "step": 120
211
  },
212
  {
213
- "epoch": 0.2642706131078224,
214
- "grad_norm": 0.09493754059076309,
215
- "learning_rate": 4.9308264744019326e-05,
216
- "loss": 0.1647,
217
- "num_input_tokens_seen": 781184,
218
  "step": 125
219
  },
220
  {
221
- "epoch": 0.2748414376321353,
222
- "grad_norm": 1.9789398908615112,
223
- "learning_rate": 4.9241159847147405e-05,
224
- "loss": 0.1683,
225
- "num_input_tokens_seen": 812160,
226
  "step": 130
227
  },
228
  {
229
- "epoch": 0.2854122621564482,
230
- "grad_norm": 0.1611855924129486,
231
- "learning_rate": 4.917099895162689e-05,
232
- "loss": 0.1597,
233
- "num_input_tokens_seen": 843584,
234
  "step": 135
235
  },
236
  {
237
- "epoch": 0.2959830866807611,
238
- "grad_norm": 0.3848009705543518,
239
- "learning_rate": 4.9097790902397686e-05,
240
- "loss": 0.1669,
241
- "num_input_tokens_seen": 875200,
242
  "step": 140
243
  },
244
  {
245
- "epoch": 0.30655391120507397,
246
- "grad_norm": 0.13839414715766907,
247
- "learning_rate": 4.902154492854374e-05,
248
- "loss": 0.1568,
249
- "num_input_tokens_seen": 906432,
250
  "step": 145
251
  },
252
  {
253
- "epoch": 0.3171247357293869,
254
- "grad_norm": 0.12030225247144699,
255
- "learning_rate": 4.8942270642129604e-05,
256
- "loss": 0.1608,
257
- "num_input_tokens_seen": 937664,
258
  "step": 150
259
  },
260
  {
261
- "epoch": 0.3276955602536998,
262
- "grad_norm": 0.21667665243148804,
263
- "learning_rate": 4.8859978036988644e-05,
264
- "loss": 0.1654,
265
- "num_input_tokens_seen": 968960,
266
  "step": 155
267
  },
268
  {
269
- "epoch": 0.3382663847780127,
270
- "grad_norm": 0.146434485912323,
271
- "learning_rate": 4.8774677487463175e-05,
272
- "loss": 0.1639,
273
- "num_input_tokens_seen": 1000192,
274
  "step": 160
275
  },
276
  {
277
- "epoch": 0.3488372093023256,
278
- "grad_norm": 0.11135861277580261,
279
- "learning_rate": 4.8686379747096556e-05,
280
- "loss": 0.16,
281
- "num_input_tokens_seen": 1031616,
282
  "step": 165
283
  },
284
  {
285
- "epoch": 0.3594080338266385,
286
- "grad_norm": 0.07981089502573013,
287
- "learning_rate": 4.85950959472776e-05,
288
- "loss": 0.1645,
289
- "num_input_tokens_seen": 1062656,
290
  "step": 170
291
  },
292
  {
293
- "epoch": 0.3699788583509514,
294
- "grad_norm": 0.057556912302970886,
295
- "learning_rate": 4.850083759583723e-05,
296
- "loss": 0.1604,
297
- "num_input_tokens_seen": 1093888,
298
  "step": 175
299
  },
300
  {
301
- "epoch": 0.38054968287526425,
302
- "grad_norm": 0.12838751077651978,
303
- "learning_rate": 4.840361657559775e-05,
304
- "loss": 0.1707,
305
- "num_input_tokens_seen": 1125184,
306
  "step": 180
307
  },
308
  {
309
- "epoch": 0.39112050739957716,
310
- "grad_norm": 0.20540139079093933,
311
- "learning_rate": 4.830344514287478e-05,
312
- "loss": 0.1544,
313
- "num_input_tokens_seen": 1156224,
314
  "step": 185
315
  },
316
  {
317
- "epoch": 0.40169133192389006,
318
- "grad_norm": 0.11389072984457016,
319
- "learning_rate": 4.8200335925932185e-05,
320
- "loss": 0.1615,
321
- "num_input_tokens_seen": 1187392,
322
  "step": 190
323
  },
324
  {
325
- "epoch": 0.41226215644820297,
326
- "grad_norm": 0.34211698174476624,
327
- "learning_rate": 4.809430192339008e-05,
328
- "loss": 0.159,
329
- "num_input_tokens_seen": 1218624,
330
  "step": 195
331
  },
332
  {
333
- "epoch": 0.42283298097251587,
334
- "grad_norm": 0.4587748348712921,
335
- "learning_rate": 4.79853565025861e-05,
336
- "loss": 0.1668,
337
- "num_input_tokens_seen": 1249984,
338
  "step": 200
339
  },
340
  {
341
- "epoch": 0.42283298097251587,
342
- "eval_loss": 0.16408737003803253,
343
- "eval_runtime": 40.303,
344
- "eval_samples_per_second": 83.443,
345
- "eval_steps_per_second": 10.446,
346
- "num_input_tokens_seen": 1249984,
347
  "step": 200
348
  },
349
  {
350
- "epoch": 0.4334038054968288,
351
- "grad_norm": 0.4510205388069153,
352
- "learning_rate": 4.787351339789025e-05,
353
- "loss": 0.1606,
354
- "num_input_tokens_seen": 1281216,
355
  "step": 205
356
  },
357
  {
358
- "epoch": 0.4439746300211416,
359
- "grad_norm": 0.07297118008136749,
360
- "learning_rate": 4.7758786708973444e-05,
361
- "loss": 0.1628,
362
- "num_input_tokens_seen": 1312768,
363
  "step": 210
364
  },
365
  {
366
- "epoch": 0.45454545454545453,
367
- "grad_norm": 0.1418861746788025,
368
- "learning_rate": 4.764119089903008e-05,
369
- "loss": 0.1617,
370
- "num_input_tokens_seen": 1344192,
371
  "step": 215
372
  },
373
  {
374
- "epoch": 0.46511627906976744,
375
- "grad_norm": 0.15124177932739258,
376
- "learning_rate": 4.752074079295457e-05,
377
- "loss": 0.162,
378
- "num_input_tokens_seen": 1375424,
379
  "step": 220
380
  },
381
  {
382
- "epoch": 0.47568710359408034,
383
- "grad_norm": 0.10217985510826111,
384
- "learning_rate": 4.739745157547258e-05,
385
- "loss": 0.1683,
386
- "num_input_tokens_seen": 1406656,
387
  "step": 225
388
  },
389
  {
390
- "epoch": 0.48625792811839325,
391
- "grad_norm": 0.24457764625549316,
392
- "learning_rate": 4.727133878922663e-05,
393
- "loss": 0.155,
394
- "num_input_tokens_seen": 1437824,
395
  "step": 230
396
  },
397
  {
398
- "epoch": 0.49682875264270615,
399
- "grad_norm": 1.5385491847991943,
400
- "learning_rate": 4.7142418332816735e-05,
401
- "loss": 0.1585,
402
- "num_input_tokens_seen": 1468992,
403
  "step": 235
404
  },
405
  {
406
- "epoch": 0.507399577167019,
407
- "grad_norm": 25.565441131591797,
408
- "learning_rate": 4.701070645879612e-05,
409
- "loss": 0.1882,
410
- "num_input_tokens_seen": 1500224,
411
  "step": 240
412
  },
413
  {
414
- "epoch": 0.5179704016913319,
415
- "grad_norm": 0.18062542378902435,
416
- "learning_rate": 4.687621977162231e-05,
417
- "loss": 0.1742,
418
- "num_input_tokens_seen": 1531584,
419
  "step": 245
420
  },
421
  {
422
- "epoch": 0.5285412262156448,
423
- "grad_norm": 0.20139415562152863,
424
- "learning_rate": 4.673897522556385e-05,
425
- "loss": 0.1607,
426
- "num_input_tokens_seen": 1562880,
427
  "step": 250
428
  },
429
  {
430
- "epoch": 0.5391120507399577,
431
- "grad_norm": 0.20215147733688354,
432
- "learning_rate": 4.6598990122562996e-05,
433
- "loss": 0.156,
434
- "num_input_tokens_seen": 1594176,
435
  "step": 255
436
  },
437
  {
438
- "epoch": 0.5496828752642706,
439
- "grad_norm": 0.19909769296646118,
440
- "learning_rate": 4.645628211005443e-05,
441
- "loss": 0.1584,
442
- "num_input_tokens_seen": 1625344,
443
  "step": 260
444
  },
445
  {
446
- "epoch": 0.5602536997885835,
447
- "grad_norm": 0.0857083797454834,
448
- "learning_rate": 4.63108691787406e-05,
449
- "loss": 0.1514,
450
- "num_input_tokens_seen": 1656448,
451
  "step": 265
452
  },
453
  {
454
- "epoch": 0.5708245243128964,
455
- "grad_norm": 0.11940807104110718,
456
- "learning_rate": 4.616276966032363e-05,
457
- "loss": 0.1649,
458
- "num_input_tokens_seen": 1687744,
459
  "step": 270
460
  },
461
  {
462
- "epoch": 0.5813953488372093,
463
- "grad_norm": 0.07466170191764832,
464
- "learning_rate": 4.6012002225194325e-05,
465
- "loss": 0.1577,
466
- "num_input_tokens_seen": 1719040,
467
  "step": 275
468
  },
469
  {
470
- "epoch": 0.5919661733615222,
471
- "grad_norm": 0.1683170348405838,
472
- "learning_rate": 4.585858588007849e-05,
473
- "loss": 0.1562,
474
- "num_input_tokens_seen": 1750208,
475
  "step": 280
476
  },
477
  {
478
- "epoch": 0.6025369978858351,
479
- "grad_norm": 0.3020932674407959,
480
- "learning_rate": 4.570253996564075e-05,
481
- "loss": 0.1438,
482
- "num_input_tokens_seen": 1781824,
483
  "step": 285
484
  },
485
  {
486
- "epoch": 0.6131078224101479,
487
- "grad_norm": 0.18477758765220642,
488
- "learning_rate": 4.554388415404644e-05,
489
- "loss": 0.165,
490
- "num_input_tokens_seen": 1813248,
491
  "step": 290
492
  },
493
  {
494
- "epoch": 0.6236786469344608,
495
- "grad_norm": 17.139799118041992,
496
- "learning_rate": 4.538263844648149e-05,
497
- "loss": 0.1618,
498
- "num_input_tokens_seen": 1844736,
499
  "step": 295
500
  },
501
  {
502
- "epoch": 0.6342494714587738,
503
- "grad_norm": 0.15323784947395325,
504
- "learning_rate": 4.521882317063103e-05,
505
- "loss": 0.1569,
506
- "num_input_tokens_seen": 1875648,
507
  "step": 300
508
  },
509
  {
510
- "epoch": 0.6342494714587738,
511
- "eval_loss": 0.16001471877098083,
512
- "eval_runtime": 40.331,
513
- "eval_samples_per_second": 83.385,
514
- "eval_steps_per_second": 10.439,
515
- "num_input_tokens_seen": 1875648,
516
  "step": 300
517
  },
518
  {
519
- "epoch": 0.6448202959830867,
520
- "grad_norm": 0.16273276507854462,
521
- "learning_rate": 4.505245897811672e-05,
522
- "loss": 0.1598,
523
- "num_input_tokens_seen": 1907008,
524
  "step": 305
525
  },
526
  {
527
- "epoch": 0.6553911205073996,
528
- "grad_norm": 0.1982152760028839,
529
- "learning_rate": 4.488356684189325e-05,
530
- "loss": 0.1501,
531
- "num_input_tokens_seen": 1938496,
532
  "step": 310
533
  },
534
  {
535
- "epoch": 0.6659619450317125,
536
- "grad_norm": 0.15199612081050873,
537
- "learning_rate": 4.4712168053604407e-05,
538
- "loss": 0.1456,
539
- "num_input_tokens_seen": 1969664,
540
  "step": 315
541
  },
542
  {
543
- "epoch": 0.6765327695560254,
544
- "grad_norm": 0.21335271000862122,
545
- "learning_rate": 4.4538284220898864e-05,
546
- "loss": 0.1502,
547
- "num_input_tokens_seen": 2001024,
548
  "step": 320
549
  },
550
  {
551
- "epoch": 0.6871035940803383,
552
- "grad_norm": 0.1967424601316452,
553
- "learning_rate": 4.4361937264706186e-05,
554
- "loss": 0.1446,
555
- "num_input_tokens_seen": 2032448,
556
  "step": 325
557
  },
558
  {
559
- "epoch": 0.6976744186046512,
560
- "grad_norm": 0.13540367782115936,
561
- "learning_rate": 4.418314941647335e-05,
562
- "loss": 0.1478,
563
- "num_input_tokens_seen": 2063872,
564
  "step": 330
565
  },
566
  {
567
- "epoch": 0.7082452431289641,
568
- "grad_norm": 0.17547021806240082,
569
- "learning_rate": 4.400194321536209e-05,
570
- "loss": 0.147,
571
- "num_input_tokens_seen": 2095104,
572
  "step": 335
573
  },
574
  {
575
- "epoch": 0.718816067653277,
576
- "grad_norm": 0.29705560207366943,
577
- "learning_rate": 4.381834150540749e-05,
578
- "loss": 0.1479,
579
- "num_input_tokens_seen": 2126336,
580
  "step": 340
581
  },
582
  {
583
- "epoch": 0.7293868921775899,
584
- "grad_norm": 0.24377129971981049,
585
- "learning_rate": 4.363236743263808e-05,
586
- "loss": 0.1448,
587
- "num_input_tokens_seen": 2157376,
588
  "step": 345
589
  },
590
  {
591
- "epoch": 0.7399577167019028,
592
- "grad_norm": 0.16772465407848358,
593
- "learning_rate": 4.3444044442157914e-05,
594
- "loss": 0.1443,
595
- "num_input_tokens_seen": 2188864,
596
  "step": 350
597
  },
598
  {
599
- "epoch": 0.7505285412262156,
600
- "grad_norm": 0.18267805874347687,
601
- "learning_rate": 4.3253396275190926e-05,
602
- "loss": 0.1464,
603
- "num_input_tokens_seen": 2220288,
604
  "step": 355
605
  },
606
  {
607
- "epoch": 0.7610993657505285,
608
- "grad_norm": 0.18752624094486237,
609
- "learning_rate": 4.306044696608797e-05,
610
- "loss": 0.1345,
611
- "num_input_tokens_seen": 2251520,
612
  "step": 360
613
  },
614
  {
615
- "epoch": 0.7716701902748414,
616
- "grad_norm": 0.21755804121494293,
617
- "learning_rate": 4.286522083929686e-05,
618
- "loss": 0.1311,
619
- "num_input_tokens_seen": 2282624,
620
  "step": 365
621
  },
622
  {
623
- "epoch": 0.7822410147991543,
624
- "grad_norm": 0.2151494175195694,
625
- "learning_rate": 4.266774250629589e-05,
626
- "loss": 0.1428,
627
- "num_input_tokens_seen": 2313792,
628
  "step": 370
629
  },
630
  {
631
- "epoch": 0.7928118393234672,
632
- "grad_norm": 0.24206243455410004,
633
- "learning_rate": 4.2468036862491176e-05,
634
- "loss": 0.1361,
635
- "num_input_tokens_seen": 2344896,
636
  "step": 375
637
  },
638
  {
639
- "epoch": 0.8033826638477801,
640
- "grad_norm": 0.26434633135795593,
641
- "learning_rate": 4.226612908407814e-05,
642
- "loss": 0.1436,
643
- "num_input_tokens_seen": 2376192,
644
  "step": 380
645
  },
646
  {
647
- "epoch": 0.813953488372093,
648
- "grad_norm": 0.26230087876319885,
649
- "learning_rate": 4.2062044624867656e-05,
650
- "loss": 0.138,
651
- "num_input_tokens_seen": 2407232,
652
  "step": 385
653
  },
654
  {
655
- "epoch": 0.8245243128964059,
656
- "grad_norm": 0.27545973658561707,
657
- "learning_rate": 4.1855809213077146e-05,
658
- "loss": 0.129,
659
- "num_input_tokens_seen": 2438528,
660
  "step": 390
661
  },
662
  {
663
- "epoch": 0.8350951374207188,
664
- "grad_norm": 0.2836856245994568,
665
- "learning_rate": 4.1647448848087166e-05,
666
- "loss": 0.1278,
667
- "num_input_tokens_seen": 2469504,
668
  "step": 395
669
  },
670
  {
671
- "epoch": 0.8456659619450317,
672
- "grad_norm": 0.3141574561595917,
673
- "learning_rate": 4.143698979716372e-05,
674
- "loss": 0.1313,
675
- "num_input_tokens_seen": 2500800,
676
  "step": 400
677
  },
678
  {
679
- "epoch": 0.8456659619450317,
680
- "eval_loss": 0.1339479386806488,
681
- "eval_runtime": 40.355,
682
- "eval_samples_per_second": 83.335,
683
- "eval_steps_per_second": 10.432,
684
- "num_input_tokens_seen": 2500800,
685
  "step": 400
686
  },
687
  {
688
- "epoch": 0.8562367864693446,
689
- "grad_norm": 0.21188335120677948,
690
- "learning_rate": 4.122445859214682e-05,
691
- "loss": 0.1308,
692
- "num_input_tokens_seen": 2531904,
693
  "step": 405
694
  },
695
  {
696
- "epoch": 0.8668076109936576,
697
- "grad_norm": 0.22360175848007202,
698
- "learning_rate": 4.100988202610577e-05,
699
- "loss": 0.1213,
700
- "num_input_tokens_seen": 2563392,
701
  "step": 410
702
  },
703
  {
704
- "epoch": 0.8773784355179705,
705
- "grad_norm": 0.1944059282541275,
706
- "learning_rate": 4.079328714996139e-05,
707
- "loss": 0.1232,
708
- "num_input_tokens_seen": 2594688,
709
  "step": 415
710
  },
711
  {
712
- "epoch": 0.8879492600422833,
713
- "grad_norm": 0.3056269884109497,
714
- "learning_rate": 4.0574701269075844e-05,
715
- "loss": 0.1328,
716
- "num_input_tokens_seen": 2626112,
717
  "step": 420
718
  },
719
  {
720
- "epoch": 0.8985200845665962,
721
- "grad_norm": 0.25777870416641235,
722
- "learning_rate": 4.035415193981032e-05,
723
- "loss": 0.1237,
724
- "num_input_tokens_seen": 2657344,
725
  "step": 425
726
  },
727
  {
728
- "epoch": 0.9090909090909091,
729
- "grad_norm": 0.3172893822193146,
730
- "learning_rate": 4.0131666966051127e-05,
731
- "loss": 0.131,
732
- "num_input_tokens_seen": 2688256,
733
  "step": 430
734
  },
735
  {
736
- "epoch": 0.919661733615222,
737
- "grad_norm": 0.3003503978252411,
738
- "learning_rate": 3.990727439570453e-05,
739
- "loss": 0.1301,
740
- "num_input_tokens_seen": 2719232,
741
  "step": 435
742
  },
743
  {
744
- "epoch": 0.9302325581395349,
745
- "grad_norm": 0.350626677274704,
746
- "learning_rate": 3.9681002517160845e-05,
747
- "loss": 0.1249,
748
- "num_input_tokens_seen": 2750464,
749
  "step": 440
750
  },
751
  {
752
- "epoch": 0.9408033826638478,
753
- "grad_norm": 1.0592330694198608,
754
- "learning_rate": 3.945287985572826e-05,
755
- "loss": 0.1176,
756
- "num_input_tokens_seen": 2781440,
757
  "step": 445
758
  },
759
  {
760
- "epoch": 0.9513742071881607,
761
- "grad_norm": 0.6262398362159729,
762
- "learning_rate": 3.922293517003668e-05,
763
- "loss": 0.119,
764
- "num_input_tokens_seen": 2812864,
765
  "step": 450
766
  },
767
  {
768
- "epoch": 0.9619450317124736,
769
- "grad_norm": 1.1160500049591064,
770
- "learning_rate": 3.899119744841232e-05,
771
- "loss": 0.1166,
772
- "num_input_tokens_seen": 2844096,
773
  "step": 455
774
  },
775
  {
776
- "epoch": 0.9725158562367865,
777
- "grad_norm": 0.24976776540279388,
778
- "learning_rate": 3.875769590522314e-05,
779
- "loss": 0.1207,
780
- "num_input_tokens_seen": 2875392,
781
  "step": 460
782
  },
783
  {
784
- "epoch": 0.9830866807610994,
785
- "grad_norm": 0.17139197885990143,
786
- "learning_rate": 3.8522459977195955e-05,
787
- "loss": 0.125,
788
- "num_input_tokens_seen": 2906432,
789
  "step": 465
790
  },
791
  {
792
- "epoch": 0.9936575052854123,
793
- "grad_norm": 0.22843952476978302,
794
- "learning_rate": 3.828551931970549e-05,
795
- "loss": 0.1278,
796
- "num_input_tokens_seen": 2937728,
797
  "step": 470
798
  },
799
  {
800
- "epoch": 1.0042283298097252,
801
- "grad_norm": 0.1976863592863083,
802
- "learning_rate": 3.8046903803035716e-05,
803
- "loss": 0.1226,
804
- "num_input_tokens_seen": 2968192,
805
  "step": 475
806
  },
807
  {
808
- "epoch": 1.014799154334038,
809
- "grad_norm": 0.280398428440094,
810
- "learning_rate": 3.780664350861431e-05,
811
- "loss": 0.1169,
812
- "num_input_tokens_seen": 2999488,
813
  "step": 480
814
  },
815
  {
816
- "epoch": 1.025369978858351,
817
- "grad_norm": 0.2658718526363373,
818
- "learning_rate": 3.756476872522035e-05,
819
- "loss": 0.116,
820
- "num_input_tokens_seen": 3030720,
821
  "step": 485
822
  },
823
  {
824
- "epoch": 1.0359408033826638,
825
- "grad_norm": 0.27286848425865173,
826
- "learning_rate": 3.7321309945165905e-05,
827
- "loss": 0.1197,
828
- "num_input_tokens_seen": 3062016,
829
  "step": 490
830
  },
831
  {
832
- "epoch": 1.0465116279069768,
833
- "grad_norm": 0.5994888544082642,
834
- "learning_rate": 3.707629786045198e-05,
835
- "loss": 0.1184,
836
- "num_input_tokens_seen": 3093184,
837
  "step": 495
838
  },
839
  {
840
- "epoch": 1.0570824524312896,
841
- "grad_norm": 0.21185331046581268,
842
- "learning_rate": 3.682976335889935e-05,
843
- "loss": 0.1134,
844
- "num_input_tokens_seen": 3124224,
845
  "step": 500
846
  },
847
  {
848
- "epoch": 1.0570824524312896,
849
- "eval_loss": 0.1192605197429657,
850
- "eval_runtime": 40.4991,
851
- "eval_samples_per_second": 83.039,
852
- "eval_steps_per_second": 10.395,
853
- "num_input_tokens_seen": 3124224,
854
  "step": 500
855
  },
856
  {
857
- "epoch": 1.0676532769556026,
858
- "grad_norm": 0.24811674654483795,
859
- "learning_rate": 3.658173752025452e-05,
860
- "loss": 0.1193,
861
- "num_input_tokens_seen": 3155584,
862
  "step": 505
863
  },
864
  {
865
- "epoch": 1.0782241014799154,
866
- "grad_norm": 0.3816189765930176,
867
- "learning_rate": 3.633225161227169e-05,
868
- "loss": 0.115,
869
- "num_input_tokens_seen": 3186944,
870
  "step": 510
871
  },
872
  {
873
- "epoch": 1.0887949260042284,
874
- "grad_norm": 0.28296881914138794,
875
- "learning_rate": 3.608133708677093e-05,
876
- "loss": 0.1146,
877
- "num_input_tokens_seen": 3218304,
878
  "step": 515
879
  },
880
  {
881
- "epoch": 1.0993657505285412,
882
- "grad_norm": 0.23222461342811584,
883
- "learning_rate": 3.5829025575673136e-05,
884
- "loss": 0.1109,
885
- "num_input_tokens_seen": 3249664,
886
  "step": 520
887
  },
888
  {
889
- "epoch": 1.109936575052854,
890
- "grad_norm": 0.2331598997116089,
891
- "learning_rate": 3.5575348887012336e-05,
892
- "loss": 0.1143,
893
- "num_input_tokens_seen": 3280960,
894
  "step": 525
895
  },
896
  {
897
- "epoch": 1.120507399577167,
898
- "grad_norm": 0.2590779662132263,
899
- "learning_rate": 3.532033900092571e-05,
900
- "loss": 0.1129,
901
- "num_input_tokens_seen": 3312320,
902
  "step": 530
903
  },
904
  {
905
- "epoch": 1.1310782241014798,
906
- "grad_norm": 0.5093595385551453,
907
- "learning_rate": 3.506402806562202e-05,
908
- "loss": 0.1139,
909
- "num_input_tokens_seen": 3343424,
910
  "step": 535
911
  },
912
  {
913
- "epoch": 1.1416490486257929,
914
- "grad_norm": 0.41402578353881836,
915
- "learning_rate": 3.480644839332876e-05,
916
- "loss": 0.1122,
917
- "num_input_tokens_seen": 3374720,
918
  "step": 540
919
  },
920
  {
921
- "epoch": 1.1522198731501057,
922
- "grad_norm": 0.2018992006778717,
923
- "learning_rate": 3.454763245621871e-05,
924
- "loss": 0.111,
925
- "num_input_tokens_seen": 3406016,
926
  "step": 545
927
  },
928
  {
929
- "epoch": 1.1627906976744187,
930
- "grad_norm": 0.7119062542915344,
931
- "learning_rate": 3.428761288231621e-05,
932
- "loss": 0.1105,
933
- "num_input_tokens_seen": 3437184,
934
  "step": 550
935
  },
936
  {
937
- "epoch": 1.1733615221987315,
938
- "grad_norm": 0.1787111908197403,
939
- "learning_rate": 3.402642245138394e-05,
940
- "loss": 0.1128,
941
- "num_input_tokens_seen": 3468416,
942
  "step": 555
943
  },
944
  {
945
- "epoch": 1.1839323467230445,
946
- "grad_norm": 0.3644562065601349,
947
- "learning_rate": 3.376409409079043e-05,
948
- "loss": 0.1066,
949
- "num_input_tokens_seen": 3499456,
950
  "step": 560
951
  },
952
  {
953
- "epoch": 1.1945031712473573,
954
- "grad_norm": 0.18238377571105957,
955
- "learning_rate": 3.350066087135903e-05,
956
- "loss": 0.1126,
957
- "num_input_tokens_seen": 3530944,
958
  "step": 565
959
  },
960
  {
961
- "epoch": 1.20507399577167,
962
- "grad_norm": 0.4499008357524872,
963
- "learning_rate": 3.323615600319883e-05,
964
- "loss": 0.1107,
965
- "num_input_tokens_seen": 3562368,
966
  "step": 570
967
  },
968
  {
969
- "epoch": 1.215644820295983,
970
- "grad_norm": 0.21635930240154266,
971
- "learning_rate": 3.297061283151791e-05,
972
- "loss": 0.1146,
973
- "num_input_tokens_seen": 3593600,
974
  "step": 575
975
  },
976
  {
977
- "epoch": 1.226215644820296,
978
- "grad_norm": 0.2716653645038605,
979
- "learning_rate": 3.27040648324197e-05,
980
- "loss": 0.1063,
981
- "num_input_tokens_seen": 3625152,
982
  "step": 580
983
  },
984
  {
985
- "epoch": 1.236786469344609,
986
- "grad_norm": 0.48543792963027954,
987
- "learning_rate": 3.243654560868268e-05,
988
- "loss": 0.1057,
989
- "num_input_tokens_seen": 3656192,
990
  "step": 585
991
  },
992
  {
993
- "epoch": 1.2473572938689217,
994
- "grad_norm": 0.14151746034622192,
995
- "learning_rate": 3.216808888552429e-05,
996
- "loss": 0.1024,
997
- "num_input_tokens_seen": 3687232,
998
  "step": 590
999
  },
1000
  {
1001
- "epoch": 1.2579281183932347,
1002
- "grad_norm": 0.14911863207817078,
1003
- "learning_rate": 3.189872850634922e-05,
1004
- "loss": 0.1006,
1005
- "num_input_tokens_seen": 3718592,
1006
  "step": 595
1007
  },
1008
  {
1009
- "epoch": 1.2684989429175475,
1010
- "grad_norm": 0.2520624101161957,
1011
- "learning_rate": 3.162849842848294e-05,
1012
- "loss": 0.1059,
1013
- "num_input_tokens_seen": 3750336,
1014
  "step": 600
1015
  },
1016
  {
1017
- "epoch": 1.2684989429175475,
1018
- "eval_loss": 0.10877919942140579,
1019
- "eval_runtime": 40.4719,
1020
- "eval_samples_per_second": 83.095,
1021
- "eval_steps_per_second": 10.402,
1022
- "num_input_tokens_seen": 3750336,
1023
  "step": 600
1024
  },
1025
  {
1026
- "epoch": 1.2790697674418605,
1027
- "grad_norm": 0.17694608867168427,
1028
- "learning_rate": 3.1357432718890815e-05,
1029
- "loss": 0.1079,
1030
- "num_input_tokens_seen": 3781632,
1031
  "step": 605
1032
  },
1033
  {
1034
- "epoch": 1.2896405919661733,
1035
- "grad_norm": 0.2516515254974365,
1036
- "learning_rate": 3.108556554988338e-05,
1037
- "loss": 0.1106,
1038
- "num_input_tokens_seen": 3812928,
1039
  "step": 610
1040
  },
1041
  {
1042
- "epoch": 1.3002114164904863,
1043
- "grad_norm": 0.19276951253414154,
1044
- "learning_rate": 3.081293119480838e-05,
1045
- "loss": 0.1027,
1046
- "num_input_tokens_seen": 3843904,
1047
  "step": 615
1048
  },
1049
  {
1050
- "epoch": 1.3107822410147991,
1051
- "grad_norm": 0.21422848105430603,
1052
- "learning_rate": 3.053956402373004e-05,
1053
- "loss": 0.1015,
1054
- "num_input_tokens_seen": 3875008,
1055
  "step": 620
1056
  },
1057
  {
1058
- "epoch": 1.3213530655391121,
1059
- "grad_norm": 0.3366522789001465,
1060
- "learning_rate": 3.0265498499096127e-05,
1061
- "loss": 0.0965,
1062
- "num_input_tokens_seen": 3906560,
1063
  "step": 625
1064
  },
1065
  {
1066
- "epoch": 1.331923890063425,
1067
- "grad_norm": 0.21864481270313263,
1068
- "learning_rate": 2.9990769171393423e-05,
1069
- "loss": 0.1106,
1070
- "num_input_tokens_seen": 3937856,
1071
  "step": 630
1072
  },
1073
  {
1074
- "epoch": 1.3424947145877377,
1075
- "grad_norm": 0.15011939406394958,
1076
- "learning_rate": 2.971541067479207e-05,
1077
- "loss": 0.0996,
1078
- "num_input_tokens_seen": 3968832,
1079
  "step": 635
1080
  },
1081
  {
1082
- "epoch": 1.3530655391120507,
1083
- "grad_norm": 0.5444221496582031,
1084
- "learning_rate": 2.9439457722779317e-05,
1085
- "loss": 0.1049,
1086
- "num_input_tokens_seen": 4000000,
1087
  "step": 640
1088
  },
1089
  {
1090
- "epoch": 1.3636363636363638,
1091
- "grad_norm": 0.2850906252861023,
1092
- "learning_rate": 2.916294510378335e-05,
1093
- "loss": 0.1118,
1094
- "num_input_tokens_seen": 4031424,
1095
  "step": 645
1096
  },
1097
  {
1098
- "epoch": 1.3742071881606766,
1099
- "grad_norm": 0.13976424932479858,
1100
- "learning_rate": 2.8885907676787622e-05,
1101
- "loss": 0.0967,
1102
- "num_input_tokens_seen": 4062720,
1103
  "step": 650
1104
  },
1105
  {
1106
- "epoch": 1.3847780126849893,
1107
- "grad_norm": 0.3354976773262024,
1108
- "learning_rate": 2.8608380366936293e-05,
1109
- "loss": 0.1035,
1110
- "num_input_tokens_seen": 4093824,
1111
  "step": 655
1112
  },
1113
  {
1114
- "epoch": 1.3953488372093024,
1115
- "grad_norm": 0.43213343620300293,
1116
- "learning_rate": 2.8330398161131376e-05,
1117
- "loss": 0.1043,
1118
- "num_input_tokens_seen": 4125120,
1119
- "step": 660
1120
- },
1121
- {
1122
- "epoch": 1.4059196617336152,
1123
- "grad_norm": 0.15570229291915894,
1124
- "learning_rate": 2.8051996103622003e-05,
1125
- "loss": 0.1045,
1126
- "num_input_tokens_seen": 4156544,
1127
- "step": 665
1128
- },
1129
- {
1130
- "epoch": 1.4164904862579282,
1131
- "grad_norm": 0.2985534965991974,
1132
- "learning_rate": 2.7773209291586567e-05,
1133
- "loss": 0.1015,
1134
- "num_input_tokens_seen": 4187904,
1135
- "step": 670
1136
- },
1137
- {
1138
- "epoch": 1.427061310782241,
1139
- "grad_norm": 1.0605559349060059,
1140
- "learning_rate": 2.749407287070812e-05,
1141
- "loss": 0.1055,
1142
- "num_input_tokens_seen": 4219072,
1143
- "step": 675
1144
- },
1145
- {
1146
- "epoch": 1.437632135306554,
1147
- "grad_norm": 0.3407301902770996,
1148
- "learning_rate": 2.7214622030743693e-05,
1149
- "loss": 0.1045,
1150
- "num_input_tokens_seen": 4250624,
1151
- "step": 680
1152
- },
1153
- {
1154
- "epoch": 1.4482029598308668,
1155
- "grad_norm": 0.4994814395904541,
1156
- "learning_rate": 2.693489200108802e-05,
1157
- "loss": 0.1035,
1158
- "num_input_tokens_seen": 4281920,
1159
- "step": 685
1160
- },
1161
- {
1162
- "epoch": 1.4587737843551798,
1163
- "grad_norm": 0.2948305606842041,
1164
- "learning_rate": 2.6654918046332323e-05,
1165
- "loss": 0.1035,
1166
- "num_input_tokens_seen": 4313024,
1167
- "step": 690
1168
- },
1169
- {
1170
- "epoch": 1.4693446088794926,
1171
- "grad_norm": 0.24761343002319336,
1172
- "learning_rate": 2.63747354618186e-05,
1173
- "loss": 0.0989,
1174
- "num_input_tokens_seen": 4344384,
1175
- "step": 695
1176
- },
1177
- {
1178
- "epoch": 1.4799154334038054,
1179
- "grad_norm": 0.1787084937095642,
1180
- "learning_rate": 2.6094379569190082e-05,
1181
- "loss": 0.096,
1182
- "num_input_tokens_seen": 4375808,
1183
- "step": 700
1184
- },
1185
- {
1186
- "epoch": 1.4799154334038054,
1187
- "eval_loss": 0.10834133625030518,
1188
- "eval_runtime": 40.5125,
1189
- "eval_samples_per_second": 83.012,
1190
- "eval_steps_per_second": 10.392,
1191
- "num_input_tokens_seen": 4375808,
1192
- "step": 700
1193
- },
1194
- {
1195
- "epoch": 1.4904862579281184,
1196
- "grad_norm": 0.30317065119743347,
1197
- "learning_rate": 2.5813885711938357e-05,
1198
- "loss": 0.1052,
1199
- "num_input_tokens_seen": 4406912,
1200
- "step": 705
1201
- },
1202
- {
1203
- "epoch": 1.5010570824524314,
1204
- "grad_norm": 0.4754318594932556,
1205
- "learning_rate": 2.553328925094773e-05,
1206
- "loss": 0.1082,
1207
- "num_input_tokens_seen": 4437952,
1208
- "step": 710
1209
- },
1210
- {
1211
- "epoch": 1.5116279069767442,
1212
- "grad_norm": 0.28454455733299255,
1213
- "learning_rate": 2.5252625560037386e-05,
1214
- "loss": 0.1053,
1215
- "num_input_tokens_seen": 4469312,
1216
- "step": 715
1217
- },
1218
- {
1219
- "epoch": 1.522198731501057,
1220
- "grad_norm": 0.20031358301639557,
1221
- "learning_rate": 2.4971930021501965e-05,
1222
- "loss": 0.1003,
1223
- "num_input_tokens_seen": 4500352,
1224
- "step": 720
1225
- },
1226
- {
1227
- "epoch": 1.53276955602537,
1228
- "grad_norm": 0.3033943176269531,
1229
- "learning_rate": 2.4691238021651042e-05,
1230
- "loss": 0.1027,
1231
- "num_input_tokens_seen": 4531584,
1232
- "step": 725
1233
- },
1234
- {
1235
- "epoch": 1.543340380549683,
1236
- "grad_norm": 0.21204060316085815,
1237
- "learning_rate": 2.4410584946348054e-05,
1238
- "loss": 0.1019,
1239
- "num_input_tokens_seen": 4562752,
1240
- "step": 730
1241
- },
1242
- {
1243
- "epoch": 1.5539112050739958,
1244
- "grad_norm": 0.21926385164260864,
1245
- "learning_rate": 2.413000617654938e-05,
1246
- "loss": 0.1094,
1247
- "num_input_tokens_seen": 4593792,
1248
- "step": 735
1249
- },
1250
- {
1251
- "epoch": 1.5644820295983086,
1252
- "grad_norm": 0.14374680817127228,
1253
- "learning_rate": 2.3849537083843936e-05,
1254
- "loss": 0.0987,
1255
- "num_input_tokens_seen": 4624832,
1256
- "step": 740
1257
- },
1258
- {
1259
- "epoch": 1.5750528541226214,
1260
- "grad_norm": 0.20950725674629211,
1261
- "learning_rate": 2.3569213025994056e-05,
1262
- "loss": 0.0973,
1263
- "num_input_tokens_seen": 4655872,
1264
- "step": 745
1265
- },
1266
- {
1267
- "epoch": 1.5856236786469344,
1268
- "grad_norm": 0.20852594077587128,
1269
- "learning_rate": 2.3289069342478018e-05,
1270
- "loss": 0.1052,
1271
- "num_input_tokens_seen": 4686912,
1272
- "step": 750
1273
- },
1274
- {
1275
- "epoch": 1.5961945031712474,
1276
- "grad_norm": 0.24457433819770813,
1277
- "learning_rate": 2.3009141350034937e-05,
1278
- "loss": 0.1069,
1279
- "num_input_tokens_seen": 4718208,
1280
- "step": 755
1281
- },
1282
- {
1283
- "epoch": 1.6067653276955602,
1284
- "grad_norm": 0.22334040701389313,
1285
- "learning_rate": 2.2729464338212515e-05,
1286
- "loss": 0.0994,
1287
- "num_input_tokens_seen": 4749376,
1288
- "step": 760
1289
- },
1290
- {
1291
- "epoch": 1.617336152219873,
1292
- "grad_norm": 0.298551082611084,
1293
- "learning_rate": 2.2450073564918185e-05,
1294
- "loss": 0.1027,
1295
- "num_input_tokens_seen": 4781120,
1296
- "step": 765
1297
- },
1298
- {
1299
- "epoch": 1.627906976744186,
1300
- "grad_norm": 0.17930828034877777,
1301
- "learning_rate": 2.21710042519743e-05,
1302
- "loss": 0.1026,
1303
- "num_input_tokens_seen": 4812480,
1304
- "step": 770
1305
- },
1306
- {
1307
- "epoch": 1.638477801268499,
1308
- "grad_norm": 0.21870951354503632,
1309
- "learning_rate": 2.1892291580677822e-05,
1310
- "loss": 0.0974,
1311
- "num_input_tokens_seen": 4843712,
1312
- "step": 775
1313
- },
1314
- {
1315
- "epoch": 1.6490486257928119,
1316
- "grad_norm": 0.31846246123313904,
1317
- "learning_rate": 2.1613970687365127e-05,
1318
- "loss": 0.1131,
1319
- "num_input_tokens_seen": 4874944,
1320
- "step": 780
1321
- },
1322
- {
1323
- "epoch": 1.6596194503171247,
1324
- "grad_norm": 0.16467052698135376,
1325
- "learning_rate": 2.1336076658982524e-05,
1326
- "loss": 0.0919,
1327
- "num_input_tokens_seen": 4906368,
1328
- "step": 785
1329
- },
1330
- {
1331
- "epoch": 1.6701902748414377,
1332
- "grad_norm": 0.21385768055915833,
1333
- "learning_rate": 2.1058644528662945e-05,
1334
- "loss": 0.1036,
1335
- "num_input_tokens_seen": 4937536,
1336
- "step": 790
1337
- },
1338
- {
1339
- "epoch": 1.6807610993657507,
1340
- "grad_norm": 0.23187273740768433,
1341
- "learning_rate": 2.0781709271309423e-05,
1342
- "loss": 0.0956,
1343
- "num_input_tokens_seen": 4968832,
1344
- "step": 795
1345
- },
1346
- {
1347
- "epoch": 1.6913319238900635,
1348
- "grad_norm": 0.1834268420934677,
1349
- "learning_rate": 2.0505305799185966e-05,
1350
- "loss": 0.0998,
1351
- "num_input_tokens_seen": 5000128,
1352
- "step": 800
1353
- },
1354
- {
1355
- "epoch": 1.6913319238900635,
1356
- "eval_loss": 0.10008509457111359,
1357
- "eval_runtime": 40.4757,
1358
- "eval_samples_per_second": 83.087,
1359
- "eval_steps_per_second": 10.401,
1360
- "num_input_tokens_seen": 5000128,
1361
- "step": 800
1362
- },
1363
- {
1364
- "epoch": 1.7019027484143763,
1365
- "grad_norm": 0.21062688529491425,
1366
- "learning_rate": 2.022946895751625e-05,
1367
- "loss": 0.0956,
1368
- "num_input_tokens_seen": 5031360,
1369
- "step": 805
1370
- },
1371
- {
1372
- "epoch": 1.712473572938689,
1373
- "grad_norm": 1.7325960397720337,
1374
- "learning_rate": 1.9954233520090843e-05,
1375
- "loss": 0.1008,
1376
- "num_input_tokens_seen": 5062720,
1377
- "step": 810
1378
- },
1379
- {
1380
- "epoch": 1.723044397463002,
1381
- "grad_norm": 0.3289014399051666,
1382
- "learning_rate": 1.967963418488335e-05,
1383
- "loss": 0.0955,
1384
- "num_input_tokens_seen": 5093888,
1385
- "step": 815
1386
- },
1387
- {
1388
- "epoch": 1.733615221987315,
1389
- "grad_norm": 0.5929372906684875,
1390
- "learning_rate": 1.9405705569676206e-05,
1391
- "loss": 0.1039,
1392
- "num_input_tokens_seen": 5125120,
1393
- "step": 820
1394
- },
1395
- {
1396
- "epoch": 1.744186046511628,
1397
- "grad_norm": 0.32440027594566345,
1398
- "learning_rate": 1.9132482207696488e-05,
1399
- "loss": 0.1005,
1400
- "num_input_tokens_seen": 5156544,
1401
- "step": 825
1402
- },
1403
- {
1404
- "epoch": 1.7547568710359407,
1405
- "grad_norm": 0.9935529828071594,
1406
- "learning_rate": 1.8859998543262474e-05,
1407
- "loss": 0.1069,
1408
- "num_input_tokens_seen": 5187776,
1409
- "step": 830
1410
- },
1411
- {
1412
- "epoch": 1.7653276955602537,
1413
- "grad_norm": 0.3179354667663574,
1414
- "learning_rate": 1.8588288927441334e-05,
1415
- "loss": 0.1004,
1416
- "num_input_tokens_seen": 5218944,
1417
- "step": 835
1418
- },
1419
- {
1420
- "epoch": 1.7758985200845667,
1421
- "grad_norm": 0.2485605925321579,
1422
- "learning_rate": 1.831738761371863e-05,
1423
- "loss": 0.1002,
1424
- "num_input_tokens_seen": 5250112,
1425
- "step": 840
1426
- },
1427
- {
1428
- "epoch": 1.7864693446088795,
1429
- "grad_norm": 0.2269657999277115,
1430
- "learning_rate": 1.8047328753680083e-05,
1431
- "loss": 0.0927,
1432
- "num_input_tokens_seen": 5281088,
1433
- "step": 845
1434
- },
1435
- {
1436
- "epoch": 1.7970401691331923,
1437
- "grad_norm": 0.2539865970611572,
1438
- "learning_rate": 1.777814639270622e-05,
1439
- "loss": 0.1013,
1440
- "num_input_tokens_seen": 5312256,
1441
- "step": 850
1442
- },
1443
- {
1444
- "epoch": 1.8076109936575053,
1445
- "grad_norm": 0.6908059120178223,
1446
- "learning_rate": 1.7509874465680377e-05,
1447
- "loss": 0.0945,
1448
- "num_input_tokens_seen": 5343744,
1449
- "step": 855
1450
- },
1451
- {
1452
- "epoch": 1.8181818181818183,
1453
- "grad_norm": 0.19062310457229614,
1454
- "learning_rate": 1.724254679271065e-05,
1455
- "loss": 0.0949,
1456
- "num_input_tokens_seen": 5374976,
1457
- "step": 860
1458
- },
1459
- {
1460
- "epoch": 1.8287526427061311,
1461
- "grad_norm": 0.2800229787826538,
1462
- "learning_rate": 1.6976197074866315e-05,
1463
- "loss": 0.0923,
1464
- "num_input_tokens_seen": 5406144,
1465
- "step": 865
1466
- },
1467
- {
1468
- "epoch": 1.839323467230444,
1469
- "grad_norm": 0.18416666984558105,
1470
- "learning_rate": 1.6710858889929255e-05,
1471
- "loss": 0.1049,
1472
- "num_input_tokens_seen": 5437760,
1473
- "step": 870
1474
- },
1475
- {
1476
- "epoch": 1.8498942917547567,
1477
- "grad_norm": 0.2170882225036621,
1478
- "learning_rate": 1.6446565688160897e-05,
1479
- "loss": 0.0906,
1480
- "num_input_tokens_seen": 5468992,
1481
- "step": 875
1482
- },
1483
- {
1484
- "epoch": 1.8604651162790697,
1485
- "grad_norm": 0.5100112557411194,
1486
- "learning_rate": 1.6183350788085317e-05,
1487
- "loss": 0.0942,
1488
- "num_input_tokens_seen": 5500288,
1489
- "step": 880
1490
- },
1491
- {
1492
- "epoch": 1.8710359408033828,
1493
- "grad_norm": 0.2084072232246399,
1494
- "learning_rate": 1.592124737228881e-05,
1495
- "loss": 0.1,
1496
- "num_input_tokens_seen": 5531456,
1497
- "step": 885
1498
- },
1499
- {
1500
- "epoch": 1.8816067653276956,
1501
- "grad_norm": 0.28143033385276794,
1502
- "learning_rate": 1.566028848323674e-05,
1503
- "loss": 0.0985,
1504
- "num_input_tokens_seen": 5562624,
1505
- "step": 890
1506
- },
1507
- {
1508
- "epoch": 1.8921775898520083,
1509
- "grad_norm": 0.5206342935562134,
1510
- "learning_rate": 1.540050701910796e-05,
1511
- "loss": 0.0959,
1512
- "num_input_tokens_seen": 5593536,
1513
- "step": 895
1514
- },
1515
- {
1516
- "epoch": 1.9027484143763214,
1517
- "grad_norm": 0.17240764200687408,
1518
- "learning_rate": 1.5141935729647461e-05,
1519
- "loss": 0.1083,
1520
- "num_input_tokens_seen": 5624576,
1521
- "step": 900
1522
- },
1523
- {
1524
- "epoch": 1.9027484143763214,
1525
- "eval_loss": 0.09912961721420288,
1526
- "eval_runtime": 40.4925,
1527
- "eval_samples_per_second": 83.052,
1528
- "eval_steps_per_second": 10.397,
1529
- "num_input_tokens_seen": 5624576,
1530
- "step": 900
1531
- },
1532
- {
1533
- "epoch": 1.9133192389006344,
1534
- "grad_norm": 0.2102658748626709,
1535
- "learning_rate": 1.4884607212037726e-05,
1536
- "loss": 0.0942,
1537
- "num_input_tokens_seen": 5655936,
1538
- "step": 905
1539
- },
1540
- {
1541
- "epoch": 1.9238900634249472,
1542
- "grad_norm": 0.18206021189689636,
1543
- "learning_rate": 1.4628553906789322e-05,
1544
- "loss": 0.1026,
1545
- "num_input_tokens_seen": 5686976,
1546
- "step": 910
1547
- },
1548
- {
1549
- "epoch": 1.93446088794926,
1550
- "grad_norm": 0.3003005385398865,
1551
- "learning_rate": 1.4373808093651215e-05,
1552
- "loss": 0.0933,
1553
- "num_input_tokens_seen": 5718592,
1554
- "step": 915
1555
- },
1556
- {
1557
- "epoch": 1.945031712473573,
1558
- "grad_norm": 0.25162649154663086,
1559
- "learning_rate": 1.4120401887541423e-05,
1560
- "loss": 0.0955,
1561
- "num_input_tokens_seen": 5749952,
1562
- "step": 920
1563
- },
1564
- {
1565
- "epoch": 1.955602536997886,
1566
- "grad_norm": 0.19604356586933136,
1567
- "learning_rate": 1.3868367234498328e-05,
1568
- "loss": 0.0933,
1569
- "num_input_tokens_seen": 5780928,
1570
- "step": 925
1571
- },
1572
- {
1573
- "epoch": 1.9661733615221988,
1574
- "grad_norm": 0.3053622543811798,
1575
- "learning_rate": 1.3617735907653434e-05,
1576
- "loss": 0.0905,
1577
- "num_input_tokens_seen": 5812032,
1578
- "step": 930
1579
- },
1580
- {
1581
- "epoch": 1.9767441860465116,
1582
- "grad_norm": 0.2663424015045166,
1583
- "learning_rate": 1.3368539503225746e-05,
1584
- "loss": 0.0959,
1585
- "num_input_tokens_seen": 5843136,
1586
- "step": 935
1587
- },
1588
- {
1589
- "epoch": 1.9873150105708244,
1590
- "grad_norm": 0.25155574083328247,
1591
- "learning_rate": 1.3120809436538656e-05,
1592
- "loss": 0.1031,
1593
- "num_input_tokens_seen": 5874752,
1594
- "step": 940
1595
- },
1596
- {
1597
- "epoch": 1.9978858350951374,
1598
- "grad_norm": 0.22895610332489014,
1599
- "learning_rate": 1.2874576938059402e-05,
1600
- "loss": 0.0896,
1601
- "num_input_tokens_seen": 5905728,
1602
- "step": 945
1603
- },
1604
- {
1605
- "epoch": 2.0084566596194504,
1606
- "grad_norm": 0.5792025327682495,
1607
- "learning_rate": 1.2629873049462032e-05,
1608
- "loss": 0.0931,
1609
- "num_input_tokens_seen": 5936448,
1610
- "step": 950
1611
- },
1612
- {
1613
- "epoch": 2.019027484143763,
1614
- "grad_norm": 0.21641181409358978,
1615
- "learning_rate": 1.2386728619714091e-05,
1616
- "loss": 0.0904,
1617
- "num_input_tokens_seen": 5967808,
1618
- "step": 955
1619
- },
1620
- {
1621
- "epoch": 2.029598308668076,
1622
- "grad_norm": 0.32977041602134705,
1623
- "learning_rate": 1.214517430118753e-05,
1624
- "loss": 0.0973,
1625
- "num_input_tokens_seen": 5998720,
1626
- "step": 960
1627
- },
1628
- {
1629
- "epoch": 2.040169133192389,
1630
- "grad_norm": 0.3212999105453491,
1631
- "learning_rate": 1.190524054579455e-05,
1632
- "loss": 0.0937,
1633
- "num_input_tokens_seen": 6030016,
1634
- "step": 965
1635
- },
1636
- {
1637
- "epoch": 2.050739957716702,
1638
- "grad_norm": 0.2424679398536682,
1639
- "learning_rate": 1.1666957601148576e-05,
1640
- "loss": 0.0898,
1641
- "num_input_tokens_seen": 6061184,
1642
- "step": 970
1643
- },
1644
- {
1645
- "epoch": 2.061310782241015,
1646
- "grad_norm": 0.39736026525497437,
1647
- "learning_rate": 1.1430355506751095e-05,
1648
- "loss": 0.1006,
1649
- "num_input_tokens_seen": 6092672,
1650
- "step": 975
1651
- },
1652
- {
1653
- "epoch": 2.0718816067653276,
1654
- "grad_norm": 0.2846342623233795,
1655
- "learning_rate": 1.119546409020461e-05,
1656
- "loss": 0.0981,
1657
- "num_input_tokens_seen": 6123712,
1658
- "step": 980
1659
- },
1660
- {
1661
- "epoch": 2.0824524312896404,
1662
- "grad_norm": 0.29333314299583435,
1663
- "learning_rate": 1.0962312963452467e-05,
1664
- "loss": 0.0943,
1665
- "num_input_tokens_seen": 6154816,
1666
- "step": 985
1667
- },
1668
- {
1669
- "epoch": 2.0930232558139537,
1670
- "grad_norm": 0.4092048108577728,
1671
- "learning_rate": 1.0730931519045697e-05,
1672
- "loss": 0.0943,
1673
- "num_input_tokens_seen": 6186176,
1674
- "step": 990
1675
- },
1676
- {
1677
- "epoch": 2.1035940803382664,
1678
- "grad_norm": 0.2516307532787323,
1679
- "learning_rate": 1.050134892643767e-05,
1680
- "loss": 0.0843,
1681
- "num_input_tokens_seen": 6217216,
1682
- "step": 995
1683
- },
1684
- {
1685
- "epoch": 2.1141649048625792,
1686
- "grad_norm": 0.2285660356283188,
1687
- "learning_rate": 1.0273594128306738e-05,
1688
- "loss": 0.0953,
1689
- "num_input_tokens_seen": 6248320,
1690
- "step": 1000
1691
- },
1692
- {
1693
- "epoch": 2.1141649048625792,
1694
- "eval_loss": 0.09716298431158066,
1695
- "eval_runtime": 40.4648,
1696
- "eval_samples_per_second": 83.109,
1697
- "eval_steps_per_second": 10.404,
1698
- "num_input_tokens_seen": 6248320,
1699
- "step": 1000
1700
- },
1701
- {
1702
- "epoch": 2.124735729386892,
1703
- "grad_norm": 0.21935948729515076,
1704
- "learning_rate": 1.00476958369076e-05,
1705
- "loss": 0.0923,
1706
- "num_input_tokens_seen": 6279552,
1707
- "step": 1005
1708
- },
1709
- {
1710
- "epoch": 2.1353065539112053,
1711
- "grad_norm": 0.3147173523902893,
1712
- "learning_rate": 9.82368253045158e-06,
1713
- "loss": 0.0847,
1714
- "num_input_tokens_seen": 6311296,
1715
- "step": 1010
1716
- },
1717
- {
1718
- "epoch": 2.145877378435518,
1719
- "grad_norm": 0.208901509642601,
1720
- "learning_rate": 9.601582449516538e-06,
1721
- "loss": 0.0921,
1722
- "num_input_tokens_seen": 6342656,
1723
- "step": 1015
1724
- },
1725
- {
1726
- "epoch": 2.156448202959831,
1727
- "grad_norm": 0.24753566086292267,
1728
- "learning_rate": 9.381423593486629e-06,
1729
- "loss": 0.0887,
1730
- "num_input_tokens_seen": 6374208,
1731
- "step": 1020
1732
- },
1733
- {
1734
- "epoch": 2.1670190274841437,
1735
- "grad_norm": 0.23306626081466675,
1736
- "learning_rate": 9.163233717022568e-06,
1737
- "loss": 0.0924,
1738
- "num_input_tokens_seen": 6405440,
1739
- "step": 1025
1740
- },
1741
- {
1742
- "epoch": 2.177589852008457,
1743
- "grad_norm": 0.22320829331874847,
1744
- "learning_rate": 8.947040326562638e-06,
1745
- "loss": 0.0884,
1746
- "num_input_tokens_seen": 6436928,
1747
- "step": 1030
1748
- },
1749
- {
1750
- "epoch": 2.1881606765327697,
1751
- "grad_norm": 0.19100725650787354,
1752
- "learning_rate": 8.732870676855096e-06,
1753
- "loss": 0.0937,
1754
- "num_input_tokens_seen": 6468288,
1755
- "step": 1035
1756
- },
1757
- {
1758
- "epoch": 2.1987315010570825,
1759
- "grad_norm": 0.17379307746887207,
1760
- "learning_rate": 8.520751767522257e-06,
1761
- "loss": 0.0856,
1762
- "num_input_tokens_seen": 6499584,
1763
- "step": 1040
1764
- },
1765
- {
1766
- "epoch": 2.2093023255813953,
1767
- "grad_norm": 0.19016264379024506,
1768
- "learning_rate": 8.310710339656707e-06,
1769
- "loss": 0.0864,
1770
- "num_input_tokens_seen": 6530752,
1771
- "step": 1045
1772
- },
1773
- {
1774
- "epoch": 2.219873150105708,
1775
- "grad_norm": 0.23884597420692444,
1776
- "learning_rate": 8.102772872450209e-06,
1777
- "loss": 0.0974,
1778
- "num_input_tokens_seen": 6561856,
1779
- "step": 1050
1780
- },
1781
- {
1782
- "epoch": 2.2304439746300213,
1783
- "grad_norm": 0.23964087665081024,
1784
- "learning_rate": 7.896965579855486e-06,
1785
- "loss": 0.0962,
1786
- "num_input_tokens_seen": 6592960,
1787
- "step": 1055
1788
- },
1789
- {
1790
- "epoch": 2.241014799154334,
1791
- "grad_norm": 0.38224127888679504,
1792
- "learning_rate": 7.693314407281615e-06,
1793
- "loss": 0.0993,
1794
- "num_input_tokens_seen": 6624256,
1795
- "step": 1060
1796
- },
1797
- {
1798
- "epoch": 2.251585623678647,
1799
- "grad_norm": 0.2022206038236618,
1800
- "learning_rate": 7.49184502832308e-06,
1801
- "loss": 0.0915,
1802
- "num_input_tokens_seen": 6655424,
1803
- "step": 1065
1804
- },
1805
- {
1806
- "epoch": 2.2621564482029597,
1807
- "grad_norm": 0.1900220513343811,
1808
- "learning_rate": 7.292582841523268e-06,
1809
- "loss": 0.0944,
1810
- "num_input_tokens_seen": 6686400,
1811
- "step": 1070
1812
- },
1813
- {
1814
- "epoch": 2.2727272727272725,
1815
- "grad_norm": 0.23861418664455414,
1816
- "learning_rate": 7.095552967172503e-06,
1817
- "loss": 0.0945,
1818
- "num_input_tokens_seen": 6717376,
1819
- "step": 1075
1820
- },
1821
- {
1822
- "epoch": 2.2832980972515857,
1823
- "grad_norm": 0.18786799907684326,
1824
- "learning_rate": 6.900780244141286e-06,
1825
- "loss": 0.0896,
1826
- "num_input_tokens_seen": 6748608,
1827
- "step": 1080
1828
- },
1829
- {
1830
- "epoch": 2.2938689217758985,
1831
- "grad_norm": 0.29745545983314514,
1832
- "learning_rate": 6.708289226748868e-06,
1833
- "loss": 0.0958,
1834
- "num_input_tokens_seen": 6779776,
1835
- "step": 1085
1836
- },
1837
- {
1838
- "epoch": 2.3044397463002113,
1839
- "grad_norm": 0.23612141609191895,
1840
- "learning_rate": 6.518104181667844e-06,
1841
- "loss": 0.0938,
1842
- "num_input_tokens_seen": 6810880,
1843
- "step": 1090
1844
- },
1845
- {
1846
- "epoch": 2.3150105708245245,
1847
- "grad_norm": 0.20987972617149353,
1848
- "learning_rate": 6.3302490848648864e-06,
1849
- "loss": 0.0923,
1850
- "num_input_tokens_seen": 6842112,
1851
- "step": 1095
1852
- },
1853
- {
1854
- "epoch": 2.3255813953488373,
1855
- "grad_norm": 0.22207896411418915,
1856
- "learning_rate": 6.144747618578209e-06,
1857
- "loss": 0.0887,
1858
- "num_input_tokens_seen": 6873152,
1859
- "step": 1100
1860
- },
1861
- {
1862
- "epoch": 2.3255813953488373,
1863
- "eval_loss": 0.09644165635108948,
1864
- "eval_runtime": 40.523,
1865
- "eval_samples_per_second": 82.99,
1866
- "eval_steps_per_second": 10.389,
1867
- "num_input_tokens_seen": 6873152,
1868
- "step": 1100
1869
- },
1870
- {
1871
- "epoch": 2.33615221987315,
1872
- "grad_norm": 0.37628617882728577,
1873
- "learning_rate": 5.961623168332006e-06,
1874
- "loss": 0.0826,
1875
- "num_input_tokens_seen": 6904512,
1876
- "step": 1105
1877
- },
1878
- {
1879
- "epoch": 2.346723044397463,
1880
- "grad_norm": 0.29637783765792847,
1881
- "learning_rate": 5.780898819988354e-06,
1882
- "loss": 0.0826,
1883
- "num_input_tokens_seen": 6936064,
1884
- "step": 1110
1885
- },
1886
- {
1887
- "epoch": 2.3572938689217757,
1888
- "grad_norm": 0.22360184788703918,
1889
- "learning_rate": 5.602597356836803e-06,
1890
- "loss": 0.0929,
1891
- "num_input_tokens_seen": 6967424,
1892
- "step": 1115
1893
- },
1894
- {
1895
- "epoch": 2.367864693446089,
1896
- "grad_norm": 0.20639285445213318,
1897
- "learning_rate": 5.426741256722239e-06,
1898
- "loss": 0.0936,
1899
- "num_input_tokens_seen": 6998592,
1900
- "step": 1120
1901
- },
1902
- {
1903
- "epoch": 2.3784355179704018,
1904
- "grad_norm": 0.25867342948913574,
1905
- "learning_rate": 5.253352689211114e-06,
1906
- "loss": 0.0856,
1907
- "num_input_tokens_seen": 7029952,
1908
- "step": 1125
1909
- },
1910
- {
1911
- "epoch": 2.3890063424947146,
1912
- "grad_norm": 0.2777279019355774,
1913
- "learning_rate": 5.082453512796634e-06,
1914
- "loss": 0.0923,
1915
- "num_input_tokens_seen": 7060992,
1916
- "step": 1130
1917
- },
1918
- {
1919
- "epoch": 2.3995771670190273,
1920
- "grad_norm": 0.31583741307258606,
1921
- "learning_rate": 4.914065272143153e-06,
1922
- "loss": 0.0911,
1923
- "num_input_tokens_seen": 7092224,
1924
- "step": 1135
1925
- },
1926
- {
1927
- "epoch": 2.41014799154334,
1928
- "grad_norm": 0.3207012116909027,
1929
- "learning_rate": 4.7482091953700705e-06,
1930
- "loss": 0.0851,
1931
- "num_input_tokens_seen": 7123776,
1932
- "step": 1140
1933
- },
1934
- {
1935
- "epoch": 2.4207188160676534,
1936
- "grad_norm": 0.19293835759162903,
1937
- "learning_rate": 4.584906191375715e-06,
1938
- "loss": 0.0956,
1939
- "num_input_tokens_seen": 7155072,
1940
- "step": 1145
1941
- },
1942
- {
1943
- "epoch": 2.431289640591966,
1944
- "grad_norm": 0.19416087865829468,
1945
- "learning_rate": 4.424176847201411e-06,
1946
- "loss": 0.0916,
1947
- "num_input_tokens_seen": 7186240,
1948
- "step": 1150
1949
- },
1950
- {
1951
- "epoch": 2.441860465116279,
1952
- "grad_norm": 0.2779330313205719,
1953
- "learning_rate": 4.266041425436151e-06,
1954
- "loss": 0.0886,
1955
- "num_input_tokens_seen": 7217536,
1956
- "step": 1155
1957
- },
1958
- {
1959
- "epoch": 2.452431289640592,
1960
- "grad_norm": 0.19005738198757172,
1961
- "learning_rate": 4.110519861662143e-06,
1962
- "loss": 0.0852,
1963
- "num_input_tokens_seen": 7248576,
1964
- "step": 1160
1965
- },
1966
- {
1967
- "epoch": 2.463002114164905,
1968
- "grad_norm": 0.2309303879737854,
1969
- "learning_rate": 3.957631761941641e-06,
1970
- "loss": 0.0942,
1971
- "num_input_tokens_seen": 7279808,
1972
- "step": 1165
1973
- },
1974
- {
1975
- "epoch": 2.473572938689218,
1976
- "grad_norm": 0.18085496127605438,
1977
- "learning_rate": 3.807396400345223e-06,
1978
- "loss": 0.0889,
1979
- "num_input_tokens_seen": 7311168,
1980
- "step": 1170
1981
- },
1982
- {
1983
- "epoch": 2.4841437632135306,
1984
- "grad_norm": 0.2057885229587555,
1985
- "learning_rate": 3.6598327165220296e-06,
1986
- "loss": 0.0907,
1987
- "num_input_tokens_seen": 7342528,
1988
- "step": 1175
1989
- },
1990
- {
1991
- "epoch": 2.4947145877378434,
1992
- "grad_norm": 0.18742726743221283,
1993
- "learning_rate": 3.514959313312061e-06,
1994
- "loss": 0.091,
1995
- "num_input_tokens_seen": 7373696,
1996
- "step": 1180
1997
- },
1998
- {
1999
- "epoch": 2.5052854122621566,
2000
- "grad_norm": 0.1891215294599533,
2001
- "learning_rate": 3.372794454401032e-06,
2002
- "loss": 0.0888,
2003
- "num_input_tokens_seen": 7404928,
2004
- "step": 1185
2005
- },
2006
- {
2007
- "epoch": 2.5158562367864694,
2008
- "grad_norm": 0.42460882663726807,
2009
- "learning_rate": 3.2333560620178727e-06,
2010
- "loss": 0.0965,
2011
- "num_input_tokens_seen": 7436096,
2012
- "step": 1190
2013
- },
2014
- {
2015
- "epoch": 2.526427061310782,
2016
- "grad_norm": 0.1930340677499771,
2017
- "learning_rate": 3.096661714675397e-06,
2018
- "loss": 0.0879,
2019
- "num_input_tokens_seen": 7467328,
2020
- "step": 1195
2021
- },
2022
- {
2023
- "epoch": 2.536997885835095,
2024
- "grad_norm": 0.18262043595314026,
2025
- "learning_rate": 2.962728644954191e-06,
2026
- "loss": 0.0889,
2027
- "num_input_tokens_seen": 7498688,
2028
- "step": 1200
2029
- },
2030
- {
2031
- "epoch": 2.536997885835095,
2032
- "eval_loss": 0.09538523107767105,
2033
- "eval_runtime": 40.4494,
2034
- "eval_samples_per_second": 83.141,
2035
- "eval_steps_per_second": 10.408,
2036
- "num_input_tokens_seen": 7498688,
2037
- "step": 1200
2038
- },
2039
- {
2040
- "epoch": 2.547568710359408,
2041
- "grad_norm": 0.18525810539722443,
2042
- "learning_rate": 2.8315737373301955e-06,
2043
- "loss": 0.089,
2044
- "num_input_tokens_seen": 7529792,
2045
- "step": 1205
2046
- },
2047
- {
2048
- "epoch": 2.558139534883721,
2049
- "grad_norm": 0.20218130946159363,
2050
- "learning_rate": 2.703213526046108e-06,
2051
- "loss": 0.0965,
2052
- "num_input_tokens_seen": 7561088,
2053
- "step": 1210
2054
- },
2055
- {
2056
- "epoch": 2.568710359408034,
2057
- "grad_norm": 0.2872017025947571,
2058
- "learning_rate": 2.577664193027013e-06,
2059
- "loss": 0.0921,
2060
- "num_input_tokens_seen": 7592448,
2061
- "step": 1215
2062
- },
2063
- {
2064
- "epoch": 2.5792811839323466,
2065
- "grad_norm": 0.19029676914215088,
2066
- "learning_rate": 2.45494156584033e-06,
2067
- "loss": 0.0831,
2068
- "num_input_tokens_seen": 7624000,
2069
- "step": 1220
2070
- },
2071
- {
2072
- "epoch": 2.58985200845666,
2073
- "grad_norm": 0.22011052072048187,
2074
- "learning_rate": 2.3350611157005182e-06,
2075
- "loss": 0.0915,
2076
- "num_input_tokens_seen": 7655232,
2077
- "step": 1225
2078
- },
2079
- {
2080
- "epoch": 2.6004228329809727,
2081
- "grad_norm": 0.26502084732055664,
2082
- "learning_rate": 2.2180379555186844e-06,
2083
- "loss": 0.0893,
2084
- "num_input_tokens_seen": 7686464,
2085
- "step": 1230
2086
- },
2087
- {
2088
- "epoch": 2.6109936575052854,
2089
- "grad_norm": 0.21893960237503052,
2090
- "learning_rate": 2.103886837997307e-06,
2091
- "loss": 0.0944,
2092
- "num_input_tokens_seen": 7717824,
2093
- "step": 1235
2094
- },
2095
- {
2096
- "epoch": 2.6215644820295982,
2097
- "grad_norm": 0.2057981640100479,
2098
- "learning_rate": 1.9926221537704794e-06,
2099
- "loss": 0.0854,
2100
- "num_input_tokens_seen": 7749120,
2101
- "step": 1240
2102
- },
2103
- {
2104
- "epoch": 2.632135306553911,
2105
- "grad_norm": 0.17995457351207733,
2106
- "learning_rate": 1.884257929589664e-06,
2107
- "loss": 0.0895,
2108
- "num_input_tokens_seen": 7780736,
2109
- "step": 1245
2110
- },
2111
- {
2112
- "epoch": 2.6427061310782243,
2113
- "grad_norm": 0.22111766040325165,
2114
- "learning_rate": 1.7788078265554398e-06,
2115
- "loss": 0.0807,
2116
- "num_input_tokens_seen": 7812288,
2117
- "step": 1250
2118
- },
2119
- {
2120
- "epoch": 2.653276955602537,
2121
- "grad_norm": 0.1810263991355896,
2122
- "learning_rate": 1.6762851383952616e-06,
2123
- "loss": 0.082,
2124
- "num_input_tokens_seen": 7843392,
2125
- "step": 1255
2126
- },
2127
- {
2128
- "epoch": 2.66384778012685,
2129
- "grad_norm": 0.21223782002925873,
2130
- "learning_rate": 1.5767027897875957e-06,
2131
- "loss": 0.0897,
2132
- "num_input_tokens_seen": 7874560,
2133
- "step": 1260
2134
- },
2135
- {
2136
- "epoch": 2.6744186046511627,
2137
- "grad_norm": 0.20275099575519562,
2138
- "learning_rate": 1.4800733347325152e-06,
2139
- "loss": 0.0909,
2140
- "num_input_tokens_seen": 7905728,
2141
- "step": 1265
2142
- },
2143
- {
2144
- "epoch": 2.6849894291754755,
2145
- "grad_norm": 0.3024641275405884,
2146
- "learning_rate": 1.3864089549691012e-06,
2147
- "loss": 0.0984,
2148
- "num_input_tokens_seen": 7937088,
2149
- "step": 1270
2150
- },
2151
- {
2152
- "epoch": 2.6955602536997887,
2153
- "grad_norm": 0.18514348566532135,
2154
- "learning_rate": 1.2957214584396997e-06,
2155
- "loss": 0.0893,
2156
- "num_input_tokens_seen": 7968704,
2157
- "step": 1275
2158
- },
2159
- {
2160
- "epoch": 2.7061310782241015,
2161
- "grad_norm": 0.16217848658561707,
2162
- "learning_rate": 1.2080222778013573e-06,
2163
- "loss": 0.0843,
2164
- "num_input_tokens_seen": 8000064,
2165
- "step": 1280
2166
- },
2167
- {
2168
- "epoch": 2.7167019027484143,
2169
- "grad_norm": 0.19633322954177856,
2170
- "learning_rate": 1.1233224689845251e-06,
2171
- "loss": 0.0892,
2172
- "num_input_tokens_seen": 8031296,
2173
- "step": 1285
2174
- },
2175
- {
2176
- "epoch": 2.7272727272727275,
2177
- "grad_norm": 0.254277765750885,
2178
- "learning_rate": 1.041632709799306e-06,
2179
- "loss": 0.0883,
2180
- "num_input_tokens_seen": 8062208,
2181
- "step": 1290
2182
- },
2183
- {
2184
- "epoch": 2.7378435517970403,
2185
- "grad_norm": 0.23036494851112366,
2186
- "learning_rate": 9.629632985893033e-07,
2187
- "loss": 0.089,
2188
- "num_input_tokens_seen": 8093440,
2189
- "step": 1295
2190
- },
2191
- {
2192
- "epoch": 2.748414376321353,
2193
- "grad_norm": 0.23279865086078644,
2194
- "learning_rate": 8.873241529333776e-07,
2195
- "loss": 0.0859,
2196
- "num_input_tokens_seen": 8124864,
2197
- "step": 1300
2198
- },
2199
- {
2200
- "epoch": 2.748414376321353,
2201
- "eval_loss": 0.09499379247426987,
2202
- "eval_runtime": 40.5097,
2203
- "eval_samples_per_second": 83.017,
2204
- "eval_steps_per_second": 10.393,
2205
- "num_input_tokens_seen": 8124864,
2206
- "step": 1300
2207
- },
2208
- {
2209
- "epoch": 2.758985200845666,
2210
- "grad_norm": 0.22809527814388275,
2211
- "learning_rate": 8.147248083953562e-07,
2212
- "loss": 0.0937,
2213
- "num_input_tokens_seen": 8156032,
2214
- "step": 1305
2215
- },
2216
- {
2217
- "epoch": 2.7695560253699787,
2218
- "grad_norm": 0.1820860654115677,
2219
- "learning_rate": 7.451744173219116e-07,
2220
- "loss": 0.0927,
2221
- "num_input_tokens_seen": 8187456,
2222
- "step": 1310
2223
- },
2224
- {
2225
- "epoch": 2.780126849894292,
2226
- "grad_norm": 0.2634679973125458,
2227
- "learning_rate": 6.786817476887725e-07,
2228
- "loss": 0.084,
2229
- "num_input_tokens_seen": 8218880,
2230
- "step": 1315
2231
- },
2232
- {
2233
- "epoch": 2.7906976744186047,
2234
- "grad_norm": 0.20365993678569794,
2235
- "learning_rate": 6.152551819953667e-07,
2236
- "loss": 0.0862,
2237
- "num_input_tokens_seen": 8250048,
2238
- "step": 1320
2239
- },
2240
- {
2241
- "epoch": 2.8012684989429175,
2242
- "grad_norm": 0.24735113978385925,
2243
- "learning_rate": 5.549027162080666e-07,
2244
- "loss": 0.0967,
2245
- "num_input_tokens_seen": 8281408,
2246
- "step": 1325
2247
- },
2248
- {
2249
- "epoch": 2.8118393234672303,
2250
- "grad_norm": 0.21733231842517853,
2251
- "learning_rate": 4.976319587521788e-07,
2252
- "loss": 0.0878,
2253
- "num_input_tokens_seen": 8312448,
2254
- "step": 1330
2255
- },
2256
- {
2257
- "epoch": 2.822410147991543,
2258
- "grad_norm": 0.39031949639320374,
2259
- "learning_rate": 4.434501295527582e-07,
2260
- "loss": 0.0923,
2261
- "num_input_tokens_seen": 8343488,
2262
- "step": 1335
2263
- },
2264
- {
2265
- "epoch": 2.8329809725158563,
2266
- "grad_norm": 0.1717582643032074,
2267
- "learning_rate": 3.9236405912442544e-07,
2268
- "loss": 0.0887,
2269
- "num_input_tokens_seen": 8374976,
2270
- "step": 1340
2271
- },
2272
- {
2273
- "epoch": 2.843551797040169,
2274
- "grad_norm": 0.19292984902858734,
2275
- "learning_rate": 3.44380187710272e-07,
2276
- "loss": 0.0862,
2277
- "num_input_tokens_seen": 8406208,
2278
- "step": 1345
2279
- },
2280
- {
2281
- "epoch": 2.854122621564482,
2282
- "grad_norm": 0.19864223897457123,
2283
- "learning_rate": 2.995045644699518e-07,
2284
- "loss": 0.0862,
2285
- "num_input_tokens_seen": 8437440,
2286
- "step": 1350
2287
- },
2288
- {
2289
- "epoch": 2.864693446088795,
2290
- "grad_norm": 0.17732787132263184,
2291
- "learning_rate": 2.577428467170989e-07,
2292
- "loss": 0.0878,
2293
- "num_input_tokens_seen": 8468416,
2294
- "step": 1355
2295
- },
2296
- {
2297
- "epoch": 2.875264270613108,
2298
- "grad_norm": 0.1831037551164627,
2299
- "learning_rate": 2.1910029920610974e-07,
2300
- "loss": 0.0881,
2301
- "num_input_tokens_seen": 8500032,
2302
- "step": 1360
2303
- },
2304
- {
2305
- "epoch": 2.8858350951374208,
2306
- "grad_norm": 0.16692957282066345,
2307
- "learning_rate": 1.8358179346845694e-07,
2308
- "loss": 0.0913,
2309
- "num_input_tokens_seen": 8531200,
2310
- "step": 1365
2311
- },
2312
- {
2313
- "epoch": 2.8964059196617336,
2314
- "grad_norm": 0.19147560000419617,
2315
- "learning_rate": 1.51191807198528e-07,
2316
- "loss": 0.0899,
2317
- "num_input_tokens_seen": 8562240,
2318
- "step": 1370
2319
- },
2320
- {
2321
- "epoch": 2.9069767441860463,
2322
- "grad_norm": 0.1842157244682312,
2323
- "learning_rate": 1.2193442368915732e-07,
2324
- "loss": 0.0813,
2325
- "num_input_tokens_seen": 8593600,
2326
- "step": 1375
2327
- },
2328
- {
2329
- "epoch": 2.9175475687103596,
2330
- "grad_norm": 0.18177741765975952,
2331
- "learning_rate": 9.581333131685467e-08,
2332
- "loss": 0.0874,
2333
- "num_input_tokens_seen": 8624768,
2334
- "step": 1380
2335
- },
2336
- {
2337
- "epoch": 2.9281183932346724,
2338
- "grad_norm": 0.2615036070346832,
2339
- "learning_rate": 7.283182307681324e-08,
2340
- "loss": 0.0915,
2341
- "num_input_tokens_seen": 8655808,
2342
- "step": 1385
2343
- },
2344
- {
2345
- "epoch": 2.938689217758985,
2346
- "grad_norm": 0.30790311098098755,
2347
- "learning_rate": 5.299279616779174e-08,
2348
- "loss": 0.0835,
2349
- "num_input_tokens_seen": 8687232,
2350
- "step": 1390
2351
- },
2352
- {
2353
- "epoch": 2.949260042283298,
2354
- "grad_norm": 0.24962230026721954,
2355
- "learning_rate": 3.629875162686203e-08,
2356
- "loss": 0.092,
2357
- "num_input_tokens_seen": 8718592,
2358
- "step": 1395
2359
- },
2360
- {
2361
- "epoch": 2.9598308668076108,
2362
- "grad_norm": 0.2310824692249298,
2363
- "learning_rate": 2.2751794014111428e-08,
2364
- "loss": 0.0883,
2365
- "num_input_tokens_seen": 8749760,
2366
- "step": 1400
2367
- },
2368
- {
2369
- "epoch": 2.9598308668076108,
2370
- "eval_loss": 0.09467408061027527,
2371
- "eval_runtime": 40.4856,
2372
- "eval_samples_per_second": 83.067,
2373
- "eval_steps_per_second": 10.399,
2374
- "num_input_tokens_seen": 8749760,
2375
- "step": 1400
2376
- },
2377
- {
2378
- "epoch": 2.970401691331924,
2379
- "grad_norm": 0.21396443247795105,
2380
- "learning_rate": 1.2353631147335454e-08,
2381
- "loss": 0.0872,
2382
- "num_input_tokens_seen": 8780992,
2383
- "step": 1405
2384
- },
2385
- {
2386
- "epoch": 2.980972515856237,
2387
- "grad_norm": 0.16851051151752472,
2388
- "learning_rate": 5.105573886735049e-09,
2389
- "loss": 0.0822,
2390
- "num_input_tokens_seen": 8812224,
2391
- "step": 1410
2392
- },
2393
- {
2394
- "epoch": 2.9915433403805496,
2395
- "grad_norm": 0.22018083930015564,
2396
- "learning_rate": 1.0085359696654362e-09,
2397
- "loss": 0.0901,
2398
- "num_input_tokens_seen": 8843200,
2399
- "step": 1415
2400
- },
2401
- {
2402
- "epoch": 3.0,
2403
- "num_input_tokens_seen": 8867536,
2404
- "step": 1419,
2405
- "total_flos": 3.600530754427945e+17,
2406
- "train_loss": 0.2131162985812786,
2407
- "train_runtime": 4701.6415,
2408
- "train_samples_per_second": 19.312,
2409
- "train_steps_per_second": 0.302
2410
  }
2411
  ],
2412
  "logging_steps": 5,
2413
- "max_steps": 1419,
2414
- "num_input_tokens_seen": 8867536,
2415
  "num_train_epochs": 3,
2416
  "save_steps": 100,
2417
  "stateful_callbacks": {
@@ -2426,8 +1139,8 @@
2426
  "attributes": {}
2427
  }
2428
  },
2429
- "total_flos": 3.600530754427945e+17,
2430
- "train_batch_size": 8,
2431
  "trial_name": null,
2432
  "trial_params": null
2433
  }
 
1
  {
2
+ "best_global_step": null,
3
  "best_metric": null,
4
  "best_model_checkpoint": null,
5
+ "epoch": 2.9897377423033067,
6
  "eval_steps": 100,
7
+ "global_step": 657,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "epoch": 0.02280501710376283,
14
+ "grad_norm": 110.79779815673828,
15
  "learning_rate": 1.25e-05,
16
+ "loss": 8.7284,
17
+ "num_input_tokens_seen": 62080,
18
  "step": 5
19
  },
20
  {
21
+ "epoch": 0.04561003420752566,
22
+ "grad_norm": 73.81541442871094,
23
  "learning_rate": 2.5e-05,
24
+ "loss": 4.9749,
25
+ "num_input_tokens_seen": 124672,
26
  "step": 10
27
  },
28
  {
29
+ "epoch": 0.06841505131128849,
30
+ "grad_norm": 9.868797302246094,
31
  "learning_rate": 3.7500000000000003e-05,
32
+ "loss": 1.5517,
33
+ "num_input_tokens_seen": 185344,
34
  "step": 15
35
  },
36
  {
37
+ "epoch": 0.09122006841505131,
38
+ "grad_norm": 4.776010036468506,
39
  "learning_rate": 5e-05,
40
+ "loss": 0.6744,
41
+ "num_input_tokens_seen": 246912,
42
  "step": 20
43
  },
44
  {
45
+ "epoch": 0.11402508551881414,
46
+ "grad_norm": 20.511587142944336,
47
+ "learning_rate": 4.9992399382187524e-05,
48
+ "loss": 0.5648,
49
+ "num_input_tokens_seen": 307712,
50
  "step": 25
51
  },
52
  {
53
+ "epoch": 0.13683010262257697,
54
+ "grad_norm": 2.5995140075683594,
55
+ "learning_rate": 4.9969602150301404e-05,
56
+ "loss": 0.5605,
57
+ "num_input_tokens_seen": 369536,
58
  "step": 30
59
  },
60
  {
61
+ "epoch": 0.15963511972633979,
62
+ "grad_norm": 13.69864273071289,
63
+ "learning_rate": 4.9931622166185365e-05,
64
+ "loss": 0.5297,
65
+ "num_input_tokens_seen": 431104,
66
  "step": 35
67
  },
68
  {
69
+ "epoch": 0.18244013683010263,
70
+ "grad_norm": 2.312450408935547,
71
+ "learning_rate": 4.987848252354691e-05,
72
+ "loss": 0.5314,
73
+ "num_input_tokens_seen": 492672,
74
  "step": 40
75
  },
76
  {
77
+ "epoch": 0.20524515393386544,
78
+ "grad_norm": 2.1790366172790527,
79
+ "learning_rate": 4.981021553391519e-05,
80
+ "loss": 0.5013,
81
+ "num_input_tokens_seen": 554368,
82
  "step": 45
83
  },
84
  {
85
+ "epoch": 0.22805017103762829,
86
+ "grad_norm": 2.9656565189361572,
87
+ "learning_rate": 4.9726862706994016e-05,
88
+ "loss": 0.4944,
89
+ "num_input_tokens_seen": 615808,
90
  "step": 50
91
  },
92
  {
93
+ "epoch": 0.2508551881413911,
94
+ "grad_norm": 2.111856460571289,
95
+ "learning_rate": 4.962847472542185e-05,
96
+ "loss": 0.5071,
97
+ "num_input_tokens_seen": 678144,
98
  "step": 55
99
  },
100
  {
101
+ "epoch": 0.27366020524515394,
102
+ "grad_norm": 2.2539072036743164,
103
+ "learning_rate": 4.951511141395432e-05,
104
+ "loss": 0.5025,
105
+ "num_input_tokens_seen": 739968,
106
  "step": 60
107
  },
108
  {
109
+ "epoch": 0.29646522234891676,
110
+ "grad_norm": 1.4161525964736938,
111
+ "learning_rate": 4.9386841703087774e-05,
112
+ "loss": 0.5038,
113
+ "num_input_tokens_seen": 801792,
114
  "step": 65
115
  },
116
  {
117
+ "epoch": 0.31927023945267957,
118
+ "grad_norm": 2.357604742050171,
119
+ "learning_rate": 4.924374358714615e-05,
120
+ "loss": 0.4907,
121
+ "num_input_tokens_seen": 863744,
122
  "step": 70
123
  },
124
  {
125
+ "epoch": 0.34207525655644244,
126
+ "grad_norm": 1.2421399354934692,
127
+ "learning_rate": 4.908590407685657e-05,
128
+ "loss": 0.4774,
129
+ "num_input_tokens_seen": 924160,
130
  "step": 75
131
  },
132
  {
133
+ "epoch": 0.36488027366020526,
134
+ "grad_norm": 1.9083991050720215,
135
+ "learning_rate": 4.891341914644251e-05,
136
+ "loss": 0.4709,
137
+ "num_input_tokens_seen": 985856,
138
  "step": 80
139
  },
140
  {
141
+ "epoch": 0.38768529076396807,
142
+ "grad_norm": 1.5763726234436035,
143
+ "learning_rate": 4.8726393675266716e-05,
144
+ "loss": 0.4793,
145
+ "num_input_tokens_seen": 1047680,
146
  "step": 85
147
  },
148
  {
149
+ "epoch": 0.4104903078677309,
150
+ "grad_norm": 2.0817372798919678,
151
+ "learning_rate": 4.8524941384059415e-05,
152
+ "loss": 0.4835,
153
+ "num_input_tokens_seen": 1108352,
154
  "step": 90
155
  },
156
  {
157
+ "epoch": 0.43329532497149376,
158
+ "grad_norm": 1.4364395141601562,
159
+ "learning_rate": 4.830918476577042e-05,
160
+ "loss": 0.4668,
161
+ "num_input_tokens_seen": 1169536,
162
  "step": 95
163
  },
164
  {
165
+ "epoch": 0.45610034207525657,
166
+ "grad_norm": 1.723344087600708,
167
+ "learning_rate": 4.807925501108744e-05,
168
+ "loss": 0.46,
169
+ "num_input_tokens_seen": 1229824,
170
  "step": 100
171
  },
172
  {
173
+ "epoch": 0.45610034207525657,
174
+ "eval_loss": 0.4706360697746277,
175
+ "eval_runtime": 34.1945,
176
+ "eval_samples_per_second": 91.184,
177
+ "eval_steps_per_second": 5.703,
178
+ "num_input_tokens_seen": 1229824,
179
  "step": 100
180
  },
181
  {
182
+ "epoch": 0.4789053591790194,
183
+ "grad_norm": 1.8234614133834839,
184
+ "learning_rate": 4.7835291928665586e-05,
185
+ "loss": 0.4823,
186
+ "num_input_tokens_seen": 1291648,
187
  "step": 105
188
  },
189
  {
190
+ "epoch": 0.5017103762827823,
191
+ "grad_norm": 1.6761844158172607,
192
+ "learning_rate": 4.7577443860116856e-05,
193
+ "loss": 0.4905,
194
+ "num_input_tokens_seen": 1353216,
195
  "step": 110
196
  },
197
  {
198
+ "epoch": 0.5245153933865451,
199
+ "grad_norm": 1.2557445764541626,
200
+ "learning_rate": 4.730586758981105e-05,
201
+ "loss": 0.4759,
202
+ "num_input_tokens_seen": 1414272,
203
  "step": 115
204
  },
205
  {
206
+ "epoch": 0.5473204104903079,
207
+ "grad_norm": 1.6095463037490845,
208
+ "learning_rate": 4.7020728249543196e-05,
209
+ "loss": 0.4735,
210
+ "num_input_tokens_seen": 1476096,
211
  "step": 120
212
  },
213
  {
214
+ "epoch": 0.5701254275940707,
215
+ "grad_norm": 1.116170883178711,
216
+ "learning_rate": 4.672219921812517e-05,
217
+ "loss": 0.4662,
218
+ "num_input_tokens_seen": 1536384,
219
  "step": 125
220
  },
221
  {
222
+ "epoch": 0.5929304446978335,
223
+ "grad_norm": 1.605276107788086,
224
+ "learning_rate": 4.6410462015962866e-05,
225
+ "loss": 0.4757,
226
+ "num_input_tokens_seen": 1597952,
227
  "step": 130
228
  },
229
  {
230
+ "epoch": 0.6157354618015963,
231
+ "grad_norm": 1.1777241230010986,
232
+ "learning_rate": 4.608570619468283e-05,
233
+ "loss": 0.4614,
234
+ "num_input_tokens_seen": 1659008,
235
  "step": 135
236
  },
237
  {
238
+ "epoch": 0.6385404789053591,
239
+ "grad_norm": 1.8780622482299805,
240
+ "learning_rate": 4.574812922187544e-05,
241
+ "loss": 0.4553,
242
+ "num_input_tokens_seen": 1719680,
243
  "step": 140
244
  },
245
  {
246
+ "epoch": 0.661345496009122,
247
+ "grad_norm": 0.93132483959198,
248
+ "learning_rate": 4.539793636102491e-05,
249
+ "loss": 0.454,
250
+ "num_input_tokens_seen": 1781376,
251
  "step": 145
252
  },
253
  {
254
+ "epoch": 0.6841505131128849,
255
+ "grad_norm": 1.902815580368042,
256
+ "learning_rate": 4.503534054669892e-05,
257
+ "loss": 0.474,
258
+ "num_input_tokens_seen": 1843456,
259
  "step": 150
260
  },
261
  {
262
+ "epoch": 0.7069555302166477,
263
+ "grad_norm": 0.9736324548721313,
264
+ "learning_rate": 4.466056225507387e-05,
265
+ "loss": 0.4635,
266
+ "num_input_tokens_seen": 1904640,
267
  "step": 155
268
  },
269
  {
270
+ "epoch": 0.7297605473204105,
271
+ "grad_norm": 1.3553400039672852,
272
+ "learning_rate": 4.427382936987449e-05,
273
+ "loss": 0.4529,
274
+ "num_input_tokens_seen": 1965312,
275
  "step": 160
276
  },
277
  {
278
+ "epoch": 0.7525655644241733,
279
+ "grad_norm": 1.386780023574829,
280
+ "learning_rate": 4.3875377043809256e-05,
281
+ "loss": 0.4581,
282
+ "num_input_tokens_seen": 2026624,
283
  "step": 165
284
  },
285
  {
286
+ "epoch": 0.7753705815279361,
287
+ "grad_norm": 1.468717336654663,
288
+ "learning_rate": 4.346544755558591e-05,
289
+ "loss": 0.4422,
290
+ "num_input_tokens_seen": 2087552,
291
  "step": 170
292
  },
293
  {
294
+ "epoch": 0.798175598631699,
295
+ "grad_norm": 1.5911728143692017,
296
+ "learning_rate": 4.304429016259407e-05,
297
+ "loss": 0.4522,
298
+ "num_input_tokens_seen": 2148864,
299
  "step": 175
300
  },
301
  {
302
+ "epoch": 0.8209806157354618,
303
+ "grad_norm": 1.3582851886749268,
304
+ "learning_rate": 4.261216094934437e-05,
305
+ "loss": 0.4457,
306
+ "num_input_tokens_seen": 2210048,
307
  "step": 180
308
  },
309
  {
310
+ "epoch": 0.8437856328392246,
311
+ "grad_norm": 1.3723617792129517,
312
+ "learning_rate": 4.216932267175645e-05,
313
+ "loss": 0.4448,
314
+ "num_input_tokens_seen": 2272000,
315
  "step": 185
316
  },
317
  {
318
+ "epoch": 0.8665906499429875,
319
+ "grad_norm": 1.4028033018112183,
320
+ "learning_rate": 4.171604459739037e-05,
321
+ "loss": 0.4442,
322
+ "num_input_tokens_seen": 2333568,
323
  "step": 190
324
  },
325
  {
326
+ "epoch": 0.8893956670467503,
327
+ "grad_norm": 1.4118390083312988,
328
+ "learning_rate": 4.125260234171861e-05,
329
+ "loss": 0.4327,
330
+ "num_input_tokens_seen": 2395776,
331
  "step": 195
332
  },
333
  {
334
+ "epoch": 0.9122006841505131,
335
+ "grad_norm": 1.2794113159179688,
336
+ "learning_rate": 4.077927770053824e-05,
337
+ "loss": 0.4222,
338
+ "num_input_tokens_seen": 2457344,
339
  "step": 200
340
  },
341
  {
342
+ "epoch": 0.9122006841505131,
343
+ "eval_loss": 0.4173198640346527,
344
+ "eval_runtime": 34.1976,
345
+ "eval_samples_per_second": 91.176,
346
+ "eval_steps_per_second": 5.702,
347
+ "num_input_tokens_seen": 2457344,
348
  "step": 200
349
  },
350
  {
351
+ "epoch": 0.935005701254276,
352
+ "grad_norm": 2.4246602058410645,
353
+ "learning_rate": 4.029635847862519e-05,
354
+ "loss": 0.419,
355
+ "num_input_tokens_seen": 2518528,
356
  "step": 205
357
  },
358
  {
359
+ "epoch": 0.9578107183580388,
360
+ "grad_norm": 1.7850072383880615,
361
+ "learning_rate": 3.980413831473465e-05,
362
+ "loss": 0.4266,
363
+ "num_input_tokens_seen": 2580096,
364
  "step": 210
365
  },
366
  {
367
+ "epoch": 0.9806157354618016,
368
+ "grad_norm": 1.9324074983596802,
369
+ "learning_rate": 3.9302916503054246e-05,
370
+ "loss": 0.4234,
371
+ "num_input_tokens_seen": 2642176,
372
  "step": 215
373
  },
374
  {
375
+ "epoch": 1.0,
376
+ "grad_norm": 2.1440515518188477,
377
+ "learning_rate": 3.8792997811218366e-05,
378
+ "loss": 0.4299,
379
+ "num_input_tokens_seen": 2694512,
380
  "step": 220
381
  },
382
  {
383
+ "epoch": 1.0228050171037628,
384
+ "grad_norm": 1.5281051397323608,
385
+ "learning_rate": 3.8274692294994375e-05,
386
+ "loss": 0.4057,
387
+ "num_input_tokens_seen": 2756208,
388
  "step": 225
389
  },
390
  {
391
+ "epoch": 1.0456100342075256,
392
+ "grad_norm": 9.625020027160645,
393
+ "learning_rate": 3.77483151097534e-05,
394
+ "loss": 0.406,
395
+ "num_input_tokens_seen": 2817520,
396
  "step": 230
397
  },
398
  {
399
+ "epoch": 1.0684150513112884,
400
+ "grad_norm": 1.0093504190444946,
401
+ "learning_rate": 3.7214186318840246e-05,
402
+ "loss": 0.4095,
403
+ "num_input_tokens_seen": 2879216,
404
  "step": 235
405
  },
406
  {
407
+ "epoch": 1.0912200684150513,
408
+ "grad_norm": 1.1837741136550903,
409
+ "learning_rate": 3.66726306989591e-05,
410
+ "loss": 0.3905,
411
+ "num_input_tokens_seen": 2940528,
412
  "step": 240
413
  },
414
  {
415
+ "epoch": 1.114025085518814,
416
+ "grad_norm": 2.219708204269409,
417
+ "learning_rate": 3.612397754269325e-05,
418
+ "loss": 0.397,
419
+ "num_input_tokens_seen": 3002096,
420
  "step": 245
421
  },
422
  {
423
+ "epoch": 1.1368301026225769,
424
+ "grad_norm": 1.8098950386047363,
425
+ "learning_rate": 3.556856045827886e-05,
426
+ "loss": 0.388,
427
+ "num_input_tokens_seen": 3063408,
428
  "step": 250
429
  },
430
  {
431
+ "epoch": 1.1596351197263397,
432
+ "grad_norm": 1.5983895063400269,
433
+ "learning_rate": 3.500671716675478e-05,
434
+ "loss": 0.3891,
435
+ "num_input_tokens_seen": 3124848,
436
  "step": 255
437
  },
438
  {
439
+ "epoch": 1.1824401368301025,
440
+ "grad_norm": 3.421966791152954,
441
+ "learning_rate": 3.4438789296611324e-05,
442
+ "loss": 0.385,
443
+ "num_input_tokens_seen": 3186544,
444
  "step": 260
445
  },
446
  {
447
+ "epoch": 1.2052451539338653,
448
+ "grad_norm": 1.309157133102417,
449
+ "learning_rate": 3.386512217606339e-05,
450
+ "loss": 0.3857,
451
+ "num_input_tokens_seen": 3247856,
452
  "step": 265
453
  },
454
  {
455
+ "epoch": 1.2280501710376284,
456
+ "grad_norm": 1.6279858350753784,
457
+ "learning_rate": 3.328606462307377e-05,
458
+ "loss": 0.3923,
459
+ "num_input_tokens_seen": 3310320,
460
  "step": 270
461
  },
462
  {
463
+ "epoch": 1.2508551881413912,
464
+ "grad_norm": 1.5124549865722656,
465
+ "learning_rate": 3.2701968733254595e-05,
466
+ "loss": 0.3751,
467
+ "num_input_tokens_seen": 3371504,
468
  "step": 275
469
  },
470
  {
471
+ "epoch": 1.273660205245154,
472
+ "grad_norm": 1.6461663246154785,
473
+ "learning_rate": 3.211318966577581e-05,
474
+ "loss": 0.386,
475
+ "num_input_tokens_seen": 3434096,
476
  "step": 280
477
  },
478
  {
479
+ "epoch": 1.2964652223489168,
480
+ "grad_norm": 23.96044158935547,
481
+ "learning_rate": 3.1520085427410856e-05,
482
+ "loss": 0.3757,
483
+ "num_input_tokens_seen": 3495280,
484
  "step": 285
485
  },
486
  {
487
+ "epoch": 1.3192702394526796,
488
+ "grad_norm": 1.4814337491989136,
489
+ "learning_rate": 3.092301665485083e-05,
490
+ "loss": 0.3831,
491
+ "num_input_tokens_seen": 3557616,
492
  "step": 290
493
  },
494
  {
495
+ "epoch": 1.3420752565564424,
496
+ "grad_norm": 1.88231360912323,
497
+ "learning_rate": 3.032234639541956e-05,
498
+ "loss": 0.3702,
499
+ "num_input_tokens_seen": 3617904,
500
  "step": 295
501
  },
502
  {
503
+ "epoch": 1.3648802736602053,
504
+ "grad_norm": 2.090513229370117,
505
+ "learning_rate": 2.971843988632292e-05,
506
+ "loss": 0.382,
507
+ "num_input_tokens_seen": 3679728,
508
  "step": 300
509
  },
510
  {
511
+ "epoch": 1.3648802736602053,
512
+ "eval_loss": 0.3807390332221985,
513
+ "eval_runtime": 34.184,
514
+ "eval_samples_per_second": 91.212,
515
+ "eval_steps_per_second": 5.704,
516
+ "num_input_tokens_seen": 3679728,
517
  "step": 300
518
  },
519
  {
520
+ "epoch": 1.387685290763968,
521
+ "grad_norm": 2.231807231903076,
522
+ "learning_rate": 2.9111664332566517e-05,
523
+ "loss": 0.392,
524
+ "num_input_tokens_seen": 3740528,
525
  "step": 305
526
  },
527
  {
528
+ "epoch": 1.4104903078677309,
529
+ "grad_norm": 1.7773799896240234,
530
+ "learning_rate": 2.850238868367691e-05,
531
+ "loss": 0.3707,
532
+ "num_input_tokens_seen": 3802992,
533
  "step": 310
534
  },
535
  {
536
+ "epoch": 1.4332953249714937,
537
+ "grad_norm": 1.0488688945770264,
538
+ "learning_rate": 2.7890983409362077e-05,
539
+ "loss": 0.3645,
540
+ "num_input_tokens_seen": 3863792,
541
  "step": 315
542
  },
543
  {
544
+ "epoch": 1.4561003420752565,
545
+ "grad_norm": 1.3682293891906738,
546
+ "learning_rate": 2.7277820274247506e-05,
547
+ "loss": 0.3599,
548
+ "num_input_tokens_seen": 3925616,
549
  "step": 320
550
  },
551
  {
552
+ "epoch": 1.4789053591790193,
553
+ "grad_norm": 1.1371090412139893,
554
+ "learning_rate": 2.6663272111824916e-05,
555
+ "loss": 0.363,
556
+ "num_input_tokens_seen": 3986416,
557
  "step": 325
558
  },
559
  {
560
+ "epoch": 1.5017103762827824,
561
+ "grad_norm": 1.2293074131011963,
562
+ "learning_rate": 2.6047712597751128e-05,
563
+ "loss": 0.3542,
564
+ "num_input_tokens_seen": 4046704,
565
  "step": 330
566
  },
567
  {
568
+ "epoch": 1.5245153933865452,
569
+ "grad_norm": 1.2787760496139526,
570
+ "learning_rate": 2.5431516022634715e-05,
571
+ "loss": 0.3479,
572
+ "num_input_tokens_seen": 4107632,
573
  "step": 335
574
  },
575
  {
576
+ "epoch": 1.547320410490308,
577
+ "grad_norm": 1.2319544553756714,
578
+ "learning_rate": 2.4815057064448865e-05,
579
+ "loss": 0.3532,
580
+ "num_input_tokens_seen": 4168816,
581
  "step": 340
582
  },
583
  {
584
+ "epoch": 1.5701254275940708,
585
+ "grad_norm": 1.7431679964065552,
586
+ "learning_rate": 2.419871056070862e-05,
587
+ "loss": 0.3564,
588
+ "num_input_tokens_seen": 4230512,
589
  "step": 345
590
  },
591
  {
592
+ "epoch": 1.5929304446978336,
593
+ "grad_norm": 1.544203281402588,
594
+ "learning_rate": 2.3582851280551207e-05,
595
+ "loss": 0.3424,
596
+ "num_input_tokens_seen": 4291952,
597
  "step": 350
598
  },
599
  {
600
+ "epoch": 1.6157354618015964,
601
+ "grad_norm": 1.4278355836868286,
602
+ "learning_rate": 2.2967853696857782e-05,
603
+ "loss": 0.3559,
604
+ "num_input_tokens_seen": 4354032,
605
  "step": 355
606
  },
607
  {
608
+ "epoch": 1.6385404789053593,
609
+ "grad_norm": 2.787482738494873,
610
+ "learning_rate": 2.2354091758555493e-05,
611
+ "loss": 0.3548,
612
+ "num_input_tokens_seen": 4415344,
613
  "step": 360
614
  },
615
  {
616
+ "epoch": 1.661345496009122,
617
+ "grad_norm": 1.2671185731887817,
618
+ "learning_rate": 2.1741938663238026e-05,
619
+ "loss": 0.3467,
620
+ "num_input_tokens_seen": 4477040,
621
  "step": 365
622
  },
623
  {
624
+ "epoch": 1.6841505131128849,
625
+ "grad_norm": 1.43488609790802,
626
+ "learning_rate": 2.1131766630242966e-05,
627
+ "loss": 0.3515,
628
+ "num_input_tokens_seen": 4539120,
629
  "step": 370
630
  },
631
  {
632
+ "epoch": 1.7069555302166477,
633
+ "grad_norm": 1.317392349243164,
634
+ "learning_rate": 2.0523946674324157e-05,
635
+ "loss": 0.3308,
636
+ "num_input_tokens_seen": 4600304,
637
  "step": 375
638
  },
639
  {
640
+ "epoch": 1.7297605473204105,
641
+ "grad_norm": 1.1536561250686646,
642
+ "learning_rate": 1.991884838005628e-05,
643
+ "loss": 0.3354,
644
+ "num_input_tokens_seen": 4661872,
645
  "step": 380
646
  },
647
  {
648
+ "epoch": 1.7525655644241733,
649
+ "grad_norm": 1.4183727502822876,
650
+ "learning_rate": 1.9316839677109242e-05,
651
+ "loss": 0.343,
652
+ "num_input_tokens_seen": 4723440,
653
  "step": 385
654
  },
655
  {
656
+ "epoch": 1.7753705815279361,
657
+ "grad_norm": 1.7120977640151978,
658
+ "learning_rate": 1.8718286616528697e-05,
659
+ "loss": 0.3456,
660
+ "num_input_tokens_seen": 4784880,
661
  "step": 390
662
  },
663
  {
664
+ "epoch": 1.798175598631699,
665
+ "grad_norm": 2.004385471343994,
666
+ "learning_rate": 1.812355314815898e-05,
667
+ "loss": 0.3286,
668
+ "num_input_tokens_seen": 4846064,
669
  "step": 395
670
  },
671
  {
672
+ "epoch": 1.8209806157354618,
673
+ "grad_norm": 1.51791512966156,
674
+ "learning_rate": 1.753300089934355e-05,
675
+ "loss": 0.3574,
676
+ "num_input_tokens_seen": 4908144,
677
  "step": 400
678
  },
679
  {
680
+ "epoch": 1.8209806157354618,
681
+ "eval_loss": 0.3323298394680023,
682
+ "eval_runtime": 34.2786,
683
+ "eval_samples_per_second": 90.961,
684
+ "eval_steps_per_second": 5.689,
685
+ "num_input_tokens_seen": 4908144,
686
  "step": 400
687
  },
688
  {
689
+ "epoch": 1.8437856328392246,
690
+ "grad_norm": 1.461567759513855,
691
+ "learning_rate": 1.694698895503774e-05,
692
+ "loss": 0.328,
693
+ "num_input_tokens_seen": 4969072,
694
  "step": 405
695
  },
696
  {
697
+ "epoch": 1.8665906499429874,
698
+ "grad_norm": 1.9793264865875244,
699
+ "learning_rate": 1.6365873639467315e-05,
700
+ "loss": 0.3358,
701
+ "num_input_tokens_seen": 5031152,
702
  "step": 410
703
  },
704
  {
705
+ "epoch": 1.8893956670467502,
706
+ "grad_norm": 1.3104889392852783,
707
+ "learning_rate": 1.5790008299465773e-05,
708
+ "loss": 0.3365,
709
+ "num_input_tokens_seen": 5092848,
710
  "step": 415
711
  },
712
  {
713
+ "epoch": 1.912200684150513,
714
+ "grad_norm": 1.2527827024459839,
715
+ "learning_rate": 1.5219743089621963e-05,
716
+ "loss": 0.3257,
717
+ "num_input_tokens_seen": 5154032,
718
  "step": 420
719
  },
720
  {
721
+ "epoch": 1.9350057012542758,
722
+ "grad_norm": 1.2217671871185303,
723
+ "learning_rate": 1.4655424759368852e-05,
724
+ "loss": 0.3369,
725
+ "num_input_tokens_seen": 5216112,
726
  "step": 425
727
  },
728
  {
729
+ "epoch": 1.9578107183580387,
730
+ "grad_norm": 1.307077407836914,
731
+ "learning_rate": 1.4097396442142646e-05,
732
+ "loss": 0.341,
733
+ "num_input_tokens_seen": 5278320,
734
  "step": 430
735
  },
736
  {
737
+ "epoch": 1.9806157354618015,
738
+ "grad_norm": 1.2900267839431763,
739
+ "learning_rate": 1.354599744674078e-05,
740
+ "loss": 0.324,
741
+ "num_input_tokens_seen": 5339504,
742
  "step": 435
743
  },
744
  {
745
+ "epoch": 2.0,
746
+ "grad_norm": 2.508204936981201,
747
+ "learning_rate": 1.3001563051005347e-05,
748
+ "loss": 0.32,
749
+ "num_input_tokens_seen": 5392000,
750
  "step": 440
751
  },
752
  {
753
+ "epoch": 2.022805017103763,
754
+ "grad_norm": 1.3662036657333374,
755
+ "learning_rate": 1.2464424297957613e-05,
756
+ "loss": 0.3055,
757
+ "num_input_tokens_seen": 5453568,
758
  "step": 445
759
  },
760
  {
761
+ "epoch": 2.0456100342075256,
762
+ "grad_norm": 1.703447937965393,
763
+ "learning_rate": 1.1934907794507532e-05,
764
+ "loss": 0.304,
765
+ "num_input_tokens_seen": 5514496,
766
  "step": 450
767
  },
768
  {
769
+ "epoch": 2.0684150513112884,
770
+ "grad_norm": 1.1712243556976318,
771
+ "learning_rate": 1.1413335512860535e-05,
772
+ "loss": 0.3105,
773
+ "num_input_tokens_seen": 5576192,
774
  "step": 455
775
  },
776
  {
777
+ "epoch": 2.0912200684150513,
778
+ "grad_norm": 1.150918960571289,
779
+ "learning_rate": 1.0900024594742591e-05,
780
+ "loss": 0.2971,
781
+ "num_input_tokens_seen": 5638016,
782
  "step": 460
783
  },
784
  {
785
+ "epoch": 2.114025085518814,
786
+ "grad_norm": 1.2124019861221313,
787
+ "learning_rate": 1.0395287158562294e-05,
788
+ "loss": 0.2986,
789
+ "num_input_tokens_seen": 5699456,
790
  "step": 465
791
  },
792
  {
793
+ "epoch": 2.136830102622577,
794
+ "grad_norm": 1.0262031555175781,
795
+ "learning_rate": 9.899430109627494e-06,
796
+ "loss": 0.3109,
797
+ "num_input_tokens_seen": 5761280,
798
  "step": 470
799
  },
800
  {
801
+ "epoch": 2.1596351197263397,
802
+ "grad_norm": 1.2616000175476074,
803
+ "learning_rate": 9.412754953531663e-06,
804
+ "loss": 0.3053,
805
+ "num_input_tokens_seen": 5823232,
806
  "step": 475
807
  },
808
  {
809
+ "epoch": 2.1824401368301025,
810
+ "grad_norm": 1.1844359636306763,
811
+ "learning_rate": 8.935557612823647e-06,
812
+ "loss": 0.3116,
813
+ "num_input_tokens_seen": 5885184,
814
  "step": 480
815
  },
816
  {
817
+ "epoch": 2.2052451539338653,
818
+ "grad_norm": 1.2591346502304077,
819
+ "learning_rate": 8.468128247072054e-06,
820
+ "loss": 0.2883,
821
+ "num_input_tokens_seen": 5946624,
822
  "step": 485
823
  },
824
  {
825
+ "epoch": 2.228050171037628,
826
+ "grad_norm": 1.0236306190490723,
827
+ "learning_rate": 8.010751076433975e-06,
828
+ "loss": 0.2859,
829
+ "num_input_tokens_seen": 6007936,
830
  "step": 490
831
  },
832
  {
833
+ "epoch": 2.250855188141391,
834
+ "grad_norm": 1.6473826169967651,
835
+ "learning_rate": 7.563704208835015e-06,
836
+ "loss": 0.3055,
837
+ "num_input_tokens_seen": 6069248,
838
  "step": 495
839
  },
840
  {
841
+ "epoch": 2.2736602052451538,
842
+ "grad_norm": 1.1068172454833984,
843
+ "learning_rate": 7.1272594708659574e-06,
844
+ "loss": 0.311,
845
+ "num_input_tokens_seen": 6131072,
846
  "step": 500
847
  },
848
  {
849
+ "epoch": 2.2736602052451538,
850
+ "eval_loss": 0.3113664388656616,
851
+ "eval_runtime": 34.2889,
852
+ "eval_samples_per_second": 90.933,
853
+ "eval_steps_per_second": 5.687,
854
+ "num_input_tokens_seen": 6131072,
855
  "step": 500
856
  },
857
  {
858
+ "epoch": 2.2964652223489166,
859
+ "grad_norm": 1.0342239141464233,
860
+ "learning_rate": 6.70168224249878e-06,
861
+ "loss": 0.2869,
862
+ "num_input_tokens_seen": 6192640,
863
  "step": 505
864
  },
865
  {
866
+ "epoch": 2.3192702394526794,
867
+ "grad_norm": 1.666923999786377,
868
+ "learning_rate": 6.28723129572247e-06,
869
+ "loss": 0.2944,
870
+ "num_input_tokens_seen": 6254592,
871
  "step": 510
872
  },
873
  {
874
+ "epoch": 2.342075256556442,
875
+ "grad_norm": 1.317696452140808,
876
+ "learning_rate": 5.884158637196923e-06,
877
+ "loss": 0.2967,
878
+ "num_input_tokens_seen": 6316288,
879
  "step": 515
880
  },
881
  {
882
+ "epoch": 2.364880273660205,
883
+ "grad_norm": 1.089593529701233,
884
+ "learning_rate": 5.49270935502037e-06,
885
+ "loss": 0.2747,
886
+ "num_input_tokens_seen": 6377600,
887
  "step": 520
888
  },
889
  {
890
+ "epoch": 2.387685290763968,
891
+ "grad_norm": 1.1242574453353882,
892
+ "learning_rate": 5.113121469703766e-06,
893
+ "loss": 0.2771,
894
+ "num_input_tokens_seen": 6439040,
895
  "step": 525
896
  },
897
  {
898
+ "epoch": 2.4104903078677307,
899
+ "grad_norm": 1.6130552291870117,
900
+ "learning_rate": 4.745625789442512e-06,
901
+ "loss": 0.2999,
902
+ "num_input_tokens_seen": 6501376,
903
  "step": 530
904
  },
905
  {
906
+ "epoch": 2.433295324971494,
907
+ "grad_norm": 1.1034862995147705,
908
+ "learning_rate": 4.390445769773676e-06,
909
+ "loss": 0.2982,
910
+ "num_input_tokens_seen": 6563840,
911
  "step": 535
912
  },
913
  {
914
+ "epoch": 2.4561003420752567,
915
+ "grad_norm": 1.1068782806396484,
916
+ "learning_rate": 4.047797377703985e-06,
917
+ "loss": 0.2822,
918
+ "num_input_tokens_seen": 6624896,
919
  "step": 540
920
  },
921
  {
922
+ "epoch": 2.4789053591790196,
923
+ "grad_norm": 1.4087668657302856,
924
+ "learning_rate": 3.717888960391222e-06,
925
+ "loss": 0.2867,
926
+ "num_input_tokens_seen": 6685184,
927
  "step": 545
928
  },
929
  {
930
+ "epoch": 2.5017103762827824,
931
+ "grad_norm": 1.0873833894729614,
932
+ "learning_rate": 3.40092111845883e-06,
933
+ "loss": 0.2763,
934
+ "num_input_tokens_seen": 6744960,
935
  "step": 550
936
  },
937
  {
938
+ "epoch": 2.524515393386545,
939
+ "grad_norm": 1.0662592649459839,
940
+ "learning_rate": 3.0970865840208446e-06,
941
+ "loss": 0.2975,
942
+ "num_input_tokens_seen": 6805888,
943
  "step": 555
944
  },
945
  {
946
+ "epoch": 2.547320410490308,
947
+ "grad_norm": 1.1920742988586426,
948
+ "learning_rate": 2.806570103491221e-06,
949
+ "loss": 0.2826,
950
+ "num_input_tokens_seen": 6866816,
951
  "step": 560
952
  },
953
  {
954
+ "epoch": 2.570125427594071,
955
+ "grad_norm": 1.203615427017212,
956
+ "learning_rate": 2.5295483252488955e-06,
957
+ "loss": 0.2807,
958
+ "num_input_tokens_seen": 6928128,
959
  "step": 565
960
  },
961
  {
962
+ "epoch": 2.5929304446978336,
963
+ "grad_norm": 1.0653432607650757,
964
+ "learning_rate": 2.266189692226844e-06,
965
+ "loss": 0.2882,
966
+ "num_input_tokens_seen": 6989056,
967
  "step": 570
968
  },
969
  {
970
+ "epoch": 2.6157354618015964,
971
+ "grad_norm": 1.287311315536499,
972
+ "learning_rate": 2.0166543394904424e-06,
973
+ "loss": 0.2878,
974
+ "num_input_tokens_seen": 7050496,
975
  "step": 575
976
  },
977
  {
978
+ "epoch": 2.6385404789053593,
979
+ "grad_norm": 1.430195927619934,
980
+ "learning_rate": 1.7810939968674418e-06,
981
+ "loss": 0.2913,
982
+ "num_input_tokens_seen": 7111680,
983
  "step": 580
984
  },
985
  {
986
+ "epoch": 2.661345496009122,
987
+ "grad_norm": 1.6326279640197754,
988
+ "learning_rate": 1.559651896688724e-06,
989
+ "loss": 0.2893,
990
+ "num_input_tokens_seen": 7173632,
991
  "step": 585
992
  },
993
  {
994
+ "epoch": 2.684150513112885,
995
+ "grad_norm": 1.2597157955169678,
996
+ "learning_rate": 1.3524626866959739e-06,
997
+ "loss": 0.2847,
998
+ "num_input_tokens_seen": 7234816,
999
  "step": 590
1000
  },
1001
  {
1002
+ "epoch": 2.7069555302166477,
1003
+ "grad_norm": 0.9821637272834778,
1004
+ "learning_rate": 1.1596523481691851e-06,
1005
+ "loss": 0.2902,
1006
+ "num_input_tokens_seen": 7297792,
1007
  "step": 595
1008
  },
1009
  {
1010
+ "epoch": 2.7297605473204105,
1011
+ "grad_norm": 1.521061897277832,
1012
+ "learning_rate": 9.813381193238462e-07,
1013
+ "loss": 0.2808,
1014
+ "num_input_tokens_seen": 7358336,
1015
  "step": 600
1016
  },
1017
  {
1018
+ "epoch": 2.7297605473204105,
1019
+ "eval_loss": 0.300137460231781,
1020
+ "eval_runtime": 34.3138,
1021
+ "eval_samples_per_second": 90.867,
1022
+ "eval_steps_per_second": 5.683,
1023
+ "num_input_tokens_seen": 7358336,
1024
  "step": 600
1025
  },
1026
  {
1027
+ "epoch": 2.7525655644241733,
1028
+ "grad_norm": 1.2478710412979126,
1029
+ "learning_rate": 8.176284240242638e-07,
1030
+ "loss": 0.2775,
1031
+ "num_input_tokens_seen": 7419520,
1032
  "step": 605
1033
  },
1034
  {
1035
+ "epoch": 2.775370581527936,
1036
+ "grad_norm": 1.0828979015350342,
1037
+ "learning_rate": 6.686228058565419e-07,
1038
+ "loss": 0.2874,
1039
+ "num_input_tokens_seen": 7480832,
1040
  "step": 610
1041
  },
1042
  {
1043
+ "epoch": 2.798175598631699,
1044
+ "grad_norm": 1.0666881799697876,
1045
+ "learning_rate": 5.344118676011172e-07,
1046
+ "loss": 0.2933,
1047
+ "num_input_tokens_seen": 7542400,
1048
  "step": 615
1049
  },
1050
  {
1051
+ "epoch": 2.8209806157354618,
1052
+ "grad_norm": 1.1293710470199585,
1053
+ "learning_rate": 4.1507721614183757e-07,
1054
+ "loss": 0.2807,
1055
+ "num_input_tokens_seen": 7603456,
1056
  "step": 620
1057
  },
1058
  {
1059
+ "epoch": 2.8437856328392246,
1060
+ "grad_norm": 1.029773473739624,
1061
+ "learning_rate": 3.1069141284489347e-07,
1062
+ "loss": 0.2843,
1063
+ "num_input_tokens_seen": 7664896,
1064
  "step": 625
1065
  },
1066
  {
1067
+ "epoch": 2.8665906499429874,
1068
+ "grad_norm": 1.2407749891281128,
1069
+ "learning_rate": 2.2131792943796138e-07,
1070
+ "loss": 0.2717,
1071
+ "num_input_tokens_seen": 7725824,
1072
  "step": 630
1073
  },
1074
  {
1075
+ "epoch": 2.88939566704675,
1076
+ "grad_norm": 1.1176257133483887,
1077
+ "learning_rate": 1.4701110941623963e-07,
1078
+ "loss": 0.2789,
1079
+ "num_input_tokens_seen": 7786752,
1080
  "step": 635
1081
  },
1082
  {
1083
+ "epoch": 2.912200684150513,
1084
+ "grad_norm": 1.183804988861084,
1085
+ "learning_rate": 8.781613499891373e-08,
1086
+ "loss": 0.2806,
1087
+ "num_input_tokens_seen": 7847680,
1088
  "step": 640
1089
  },
1090
  {
1091
+ "epoch": 2.935005701254276,
1092
+ "grad_norm": 1.333591341972351,
1093
+ "learning_rate": 4.376899965614079e-08,
1094
+ "loss": 0.2729,
1095
+ "num_input_tokens_seen": 7908992,
1096
  "step": 645
1097
  },
1098
  {
1099
+ "epoch": 2.9578107183580387,
1100
+ "grad_norm": 1.4014573097229004,
1101
+ "learning_rate": 1.4896486223239802e-08,
1102
+ "loss": 0.2868,
1103
+ "num_input_tokens_seen": 7970816,
1104
  "step": 650
1105
  },
1106
  {
1107
+ "epoch": 2.9806157354618015,
1108
+ "grad_norm": 1.1442041397094727,
1109
+ "learning_rate": 1.2161506153990366e-09,
1110
+ "loss": 0.2849,
1111
+ "num_input_tokens_seen": 8032384,
1112
  "step": 655
1113
  },
1114
  {
1115
+ "epoch": 2.9897377423033067,
1116
+ "num_input_tokens_seen": 8057088,
1117
+ "step": 657,
1118
+ "total_flos": 3.334823948247368e+17,
1119
+ "train_loss": 0.48195079037043603,
1120
+ "train_runtime": 3553.5571,
1121
+ "train_samples_per_second": 23.687,
1122
+ "train_steps_per_second": 0.185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1123
  }
1124
  ],
1125
  "logging_steps": 5,
1126
+ "max_steps": 657,
1127
+ "num_input_tokens_seen": 8057088,
1128
  "num_train_epochs": 3,
1129
  "save_steps": 100,
1130
  "stateful_callbacks": {
 
1139
  "attributes": {}
1140
  }
1141
  },
1142
+ "total_flos": 3.334823948247368e+17,
1143
+ "train_batch_size": 16,
1144
  "trial_name": null,
1145
  "trial_params": null
1146
  }
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fc84cc3c5835f38c5b3721c54b978793ec53284573feda6cdf598a3f1b2a496
3
  size 5688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b48b6e581e832c695c8ab5978ff0e1f88a2dfc221f76a2f07d78f80a3cc7fb5f
3
  size 5688
training_args.yaml CHANGED
@@ -19,16 +19,17 @@ lora_target: q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj
19
  lr_scheduler_type: cosine
20
  max_grad_norm: 1.0
21
  max_samples: 100000
22
- model_name_or_path: GreatCaptainNemo/ProLLaMA_Stage_1
23
  num_train_epochs: 3.0
24
  optim: adamw_torch
25
- output_dir: saves/Custom/lora/train_2025-03-11-22-40-04
26
  packing: false
27
- per_device_eval_batch_size: 8
28
- per_device_train_batch_size: 8
29
  plot_loss: true
30
  preprocessing_num_workers: 16
31
  report_to: none
 
32
  save_steps: 100
33
  stage: sft
34
  template: alpaca
 
19
  lr_scheduler_type: cosine
20
  max_grad_norm: 1.0
21
  max_samples: 100000
22
+ model_name_or_path: GreatCaptainNemo/ProLLaMA
23
  num_train_epochs: 3.0
24
  optim: adamw_torch
25
+ output_dir: saves/Custom/lora/train_2025-04-05-23-57-03
26
  packing: false
27
+ per_device_eval_batch_size: 16
28
+ per_device_train_batch_size: 16
29
  plot_loss: true
30
  preprocessing_num_workers: 16
31
  report_to: none
32
+ resize_vocab: true
33
  save_steps: 100
34
  stage: sft
35
  template: alpaca
training_eval_loss.png CHANGED

Git LFS Details

  • SHA256: 8955642a7c3b95415874d382d673e1f9845ef93a17283e99918f63fe5d73e502
  • Pointer size: 130 Bytes
  • Size of remote file: 38.8 kB

Git LFS Details

  • SHA256: 283ac66c042f0575f5ce77a8a46121fabdac2621634d02a2b04aae38b673eb3d
  • Pointer size: 130 Bytes
  • Size of remote file: 42.3 kB
training_loss.png CHANGED

Git LFS Details

  • SHA256: b3a1b5607093702b9189544da88a588cb34770a4ad7c58235e64b78073a682c1
  • Pointer size: 130 Bytes
  • Size of remote file: 29.4 kB

Git LFS Details

  • SHA256: a3f184c95e3c8a4e7cf84ef821edaada399ee4be50696957592d141cd2c71685
  • Pointer size: 130 Bytes
  • Size of remote file: 30.2 kB