NBAmine commited on
Commit
f4b65d4
·
verified ·
1 Parent(s): 2c81415

Clean up: Remove redundant training checkpoint folder

Browse files
last-checkpoint/README.md DELETED
@@ -1,209 +0,0 @@
1
- ---
2
- base_model: mistralai/Mistral-Nemo-Instruct-2407
3
- library_name: peft
4
- pipeline_tag: text-generation
5
- tags:
6
- - base_model:adapter:mistralai/Mistral-Nemo-Instruct-2407
7
- - lora
8
- - sft
9
- - transformers
10
- - trl
11
- ---
12
-
13
- # Model Card for Model ID
14
-
15
- <!-- Provide a quick summary of what the model is/does. -->
16
-
17
-
18
-
19
- ## Model Details
20
-
21
- ### Model Description
22
-
23
- <!-- Provide a longer summary of what this model is. -->
24
-
25
-
26
-
27
- - **Developed by:** [More Information Needed]
28
- - **Funded by [optional]:** [More Information Needed]
29
- - **Shared by [optional]:** [More Information Needed]
30
- - **Model type:** [More Information Needed]
31
- - **Language(s) (NLP):** [More Information Needed]
32
- - **License:** [More Information Needed]
33
- - **Finetuned from model [optional]:** [More Information Needed]
34
-
35
- ### Model Sources [optional]
36
-
37
- <!-- Provide the basic links for the model. -->
38
-
39
- - **Repository:** [More Information Needed]
40
- - **Paper [optional]:** [More Information Needed]
41
- - **Demo [optional]:** [More Information Needed]
42
-
43
- ## Uses
44
-
45
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
-
47
- ### Direct Use
48
-
49
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
-
51
- [More Information Needed]
52
-
53
- ### Downstream Use [optional]
54
-
55
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
-
57
- [More Information Needed]
58
-
59
- ### Out-of-Scope Use
60
-
61
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
-
63
- [More Information Needed]
64
-
65
- ## Bias, Risks, and Limitations
66
-
67
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
-
69
- [More Information Needed]
70
-
71
- ### Recommendations
72
-
73
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
-
75
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
-
77
- ## How to Get Started with the Model
78
-
79
- Use the code below to get started with the model.
80
-
81
- [More Information Needed]
82
-
83
- ## Training Details
84
-
85
- ### Training Data
86
-
87
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
-
89
- [More Information Needed]
90
-
91
- ### Training Procedure
92
-
93
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
-
95
- #### Preprocessing [optional]
96
-
97
- [More Information Needed]
98
-
99
-
100
- #### Training Hyperparameters
101
-
102
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
-
104
- #### Speeds, Sizes, Times [optional]
105
-
106
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
-
108
- [More Information Needed]
109
-
110
- ## Evaluation
111
-
112
- <!-- This section describes the evaluation protocols and provides the results. -->
113
-
114
- ### Testing Data, Factors & Metrics
115
-
116
- #### Testing Data
117
-
118
- <!-- This should link to a Dataset Card if possible. -->
119
-
120
- [More Information Needed]
121
-
122
- #### Factors
123
-
124
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
-
126
- [More Information Needed]
127
-
128
- #### Metrics
129
-
130
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
-
132
- [More Information Needed]
133
-
134
- ### Results
135
-
136
- [More Information Needed]
137
-
138
- #### Summary
139
-
140
-
141
-
142
- ## Model Examination [optional]
143
-
144
- <!-- Relevant interpretability work for the model goes here -->
145
-
146
- [More Information Needed]
147
-
148
- ## Environmental Impact
149
-
150
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
-
152
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
-
154
- - **Hardware Type:** [More Information Needed]
155
- - **Hours used:** [More Information Needed]
156
- - **Cloud Provider:** [More Information Needed]
157
- - **Compute Region:** [More Information Needed]
158
- - **Carbon Emitted:** [More Information Needed]
159
-
160
- ## Technical Specifications [optional]
161
-
162
- ### Model Architecture and Objective
163
-
164
- [More Information Needed]
165
-
166
- ### Compute Infrastructure
167
-
168
- [More Information Needed]
169
-
170
- #### Hardware
171
-
172
- [More Information Needed]
173
-
174
- #### Software
175
-
176
- [More Information Needed]
177
-
178
- ## Citation [optional]
179
-
180
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
-
182
- **BibTeX:**
183
-
184
- [More Information Needed]
185
-
186
- **APA:**
187
-
188
- [More Information Needed]
189
-
190
- ## Glossary [optional]
191
-
192
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
-
194
- [More Information Needed]
195
-
196
- ## More Information [optional]
197
-
198
- [More Information Needed]
199
-
200
- ## Model Card Authors [optional]
201
-
202
- [More Information Needed]
203
-
204
- ## Model Card Contact
205
-
206
- [More Information Needed]
207
- ### Framework versions
208
-
209
- - PEFT 0.18.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
last-checkpoint/adapter_config.json DELETED
@@ -1,46 +0,0 @@
1
- {
2
- "alora_invocation_tokens": null,
3
- "alpha_pattern": {},
4
- "arrow_config": null,
5
- "auto_mapping": null,
6
- "base_model_name_or_path": "mistralai/Mistral-Nemo-Instruct-2407",
7
- "bias": "none",
8
- "corda_config": null,
9
- "ensure_weight_tying": false,
10
- "eva_config": null,
11
- "exclude_modules": null,
12
- "fan_in_fan_out": false,
13
- "inference_mode": true,
14
- "init_lora_weights": true,
15
- "layer_replication": null,
16
- "layers_pattern": null,
17
- "layers_to_transform": null,
18
- "loftq_config": {},
19
- "lora_alpha": 32,
20
- "lora_bias": false,
21
- "lora_dropout": 0.05,
22
- "megatron_config": null,
23
- "megatron_core": "megatron.core",
24
- "modules_to_save": null,
25
- "peft_type": "LORA",
26
- "peft_version": "0.18.1",
27
- "qalora_group_size": 16,
28
- "r": 16,
29
- "rank_pattern": {},
30
- "revision": null,
31
- "target_modules": [
32
- "k_proj",
33
- "down_proj",
34
- "o_proj",
35
- "gate_proj",
36
- "up_proj",
37
- "q_proj",
38
- "v_proj"
39
- ],
40
- "target_parameters": null,
41
- "task_type": "CAUSAL_LM",
42
- "trainable_token_indices": null,
43
- "use_dora": false,
44
- "use_qalora": false,
45
- "use_rslora": false
46
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
last-checkpoint/adapter_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c14b7d8a1648c56d9f25f88d48454e081b6bc178d61bd9f0aadd582257678003
3
- size 228140600
 
 
 
 
last-checkpoint/chat_template.jinja DELETED
@@ -1,87 +0,0 @@
1
- {%- if messages[0]["role"] == "system" %}
2
- {%- set system_message = messages[0]["content"] %}
3
- {%- set loop_messages = messages[1:] %}
4
- {%- else %}
5
- {%- set loop_messages = messages %}
6
- {%- endif %}
7
- {%- if not tools is defined %}
8
- {%- set tools = none %}
9
- {%- endif %}
10
- {%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}
11
-
12
- {#- This block checks for alternating user/assistant messages, skipping tool calling messages #}
13
- {%- set ns = namespace() %}
14
- {%- set ns.index = 0 %}
15
- {%- for message in loop_messages %}
16
- {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}
17
- {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}
18
- {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}
19
- {%- endif %}
20
- {%- set ns.index = ns.index + 1 %}
21
- {%- endif %}
22
- {%- endfor %}
23
-
24
- {{- bos_token }}
25
- {%- for message in loop_messages %}
26
- {%- if message["role"] == "user" %}
27
- {%- if tools is not none and (message == user_messages[-1]) %}
28
- {{- "[AVAILABLE_TOOLS][" }}
29
- {%- for tool in tools %}
30
- {%- set tool = tool.function %}
31
- {{- '{"type": "function", "function": {' }}
32
- {%- for key, val in tool.items() if key != "return" %}
33
- {%- if val is string %}
34
- {{- '"' + key + '": "' + val + '"' }}
35
- {%- else %}
36
- {{- '"' + key + '": ' + val|tojson }}
37
- {%- endif %}
38
- {%- if not loop.last %}
39
- {{- ", " }}
40
- {%- endif %}
41
- {%- endfor %}
42
- {{- "}}" }}
43
- {%- if not loop.last %}
44
- {{- ", " }}
45
- {%- else %}
46
- {{- "]" }}
47
- {%- endif %}
48
- {%- endfor %}
49
- {{- "[/AVAILABLE_TOOLS]" }}
50
- {%- endif %}
51
- {%- if loop.last and system_message is defined %}
52
- {{- "[INST]" + system_message + "\n\n" + message["content"] + "[/INST]" }}
53
- {%- else %}
54
- {{- "[INST]" + message["content"] + "[/INST]" }}
55
- {%- endif %}
56
- {%- elif (message.tool_calls is defined and message.tool_calls is not none) %}
57
- {{- "[TOOL_CALLS][" }}
58
- {%- for tool_call in message.tool_calls %}
59
- {%- set out = tool_call.function|tojson %}
60
- {{- out[:-1] }}
61
- {%- if not tool_call.id is defined or tool_call.id|length != 9 %}
62
- {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
63
- {%- endif %}
64
- {{- ', "id": "' + tool_call.id + '"}' }}
65
- {%- if not loop.last %}
66
- {{- ", " }}
67
- {%- else %}
68
- {{- "]" + eos_token }}
69
- {%- endif %}
70
- {%- endfor %}
71
- {%- elif message["role"] == "assistant" %}
72
- {{- message["content"] + eos_token}}
73
- {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}
74
- {%- if message.content is defined and message.content.content is defined %}
75
- {%- set content = message.content.content %}
76
- {%- else %}
77
- {%- set content = message.content %}
78
- {%- endif %}
79
- {{- '[TOOL_RESULTS]{"content": ' + content|string + ", " }}
80
- {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}
81
- {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
82
- {%- endif %}
83
- {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }}
84
- {%- else %}
85
- {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
86
- {%- endif %}
87
- {%- endfor %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
last-checkpoint/optimizer.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae997529af0fecd00cf9ea60649b8488f2fcad93e7d57149ed2055f7e443e81c
3
- size 117931203
 
 
 
 
last-checkpoint/rng_state.pth DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:a806988fecdee5121c06d7240dec6e61421fb0008f39bed17de1e2ca05215f14
3
- size 14645
 
 
 
 
last-checkpoint/scaler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:dfca50dfc66d4be0e8bab60e1bfd495197005d876487c7e37b847562cfa51471
3
- size 1383
 
 
 
 
last-checkpoint/scheduler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:7c8cc8d8f0165185e683fffb0aab5024d4cdc129dcf7f2bae80e3717e00f0c4e
3
- size 1465
 
 
 
 
last-checkpoint/special_tokens_map.json DELETED
@@ -1,24 +0,0 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "</s>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": "<unk>",
17
- "unk_token": {
18
- "content": "<unk>",
19
- "lstrip": false,
20
- "normalized": false,
21
- "rstrip": false,
22
- "single_word": false
23
- }
24
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
last-checkpoint/tokenizer.json DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b0240ce510f08e6c2041724e9043e33be9d251d1e4a4d94eb68cd47b954b61d2
3
- size 17078292
 
 
 
 
last-checkpoint/tokenizer_config.json DELETED
The diff for this file is too large to render. See raw diff
 
last-checkpoint/trainer_state.json DELETED
@@ -1,2284 +0,0 @@
1
- {
2
- "best_global_step": 438,
3
- "best_metric": 1.2615772485733032,
4
- "best_model_checkpoint": "./adapter-phase2/checkpoint-438",
5
- "epoch": 5.0,
6
- "eval_steps": 500,
7
- "global_step": 2190,
8
- "is_hyper_param_search": false,
9
- "is_local_process_zero": true,
10
- "is_world_process_zero": true,
11
- "log_history": [
12
- {
13
- "entropy": 0.9881319830194115,
14
- "epoch": 0.022857142857142857,
15
- "grad_norm": 1.5965849161148071,
16
- "learning_rate": 9.958904109589041e-06,
17
- "loss": 1.6407,
18
- "mean_token_accuracy": 0.6662811586633325,
19
- "num_tokens": 15996.0,
20
- "step": 10
21
- },
22
- {
23
- "entropy": 1.2699928235262632,
24
- "epoch": 0.045714285714285714,
25
- "grad_norm": 1.3544626235961914,
26
- "learning_rate": 9.913242009132421e-06,
27
- "loss": 1.5705,
28
- "mean_token_accuracy": 0.6635363385081291,
29
- "num_tokens": 27684.0,
30
- "step": 20
31
- },
32
- {
33
- "entropy": 1.6406203664839267,
34
- "epoch": 0.06857142857142857,
35
- "grad_norm": 1.615886926651001,
36
- "learning_rate": 9.8675799086758e-06,
37
- "loss": 1.6826,
38
- "mean_token_accuracy": 0.6459833424538374,
39
- "num_tokens": 36498.0,
40
- "step": 30
41
- },
42
- {
43
- "entropy": 1.8021161697804928,
44
- "epoch": 0.09142857142857143,
45
- "grad_norm": 1.3791861534118652,
46
- "learning_rate": 9.821917808219178e-06,
47
- "loss": 1.6402,
48
- "mean_token_accuracy": 0.6510010983794927,
49
- "num_tokens": 43095.0,
50
- "step": 40
51
- },
52
- {
53
- "entropy": 1.8170176550745964,
54
- "epoch": 0.11428571428571428,
55
- "grad_norm": 1.978265643119812,
56
- "learning_rate": 9.776255707762557e-06,
57
- "loss": 1.595,
58
- "mean_token_accuracy": 0.6443639542907477,
59
- "num_tokens": 47981.0,
60
- "step": 50
61
- },
62
- {
63
- "entropy": 1.1217214532196522,
64
- "epoch": 0.13714285714285715,
65
- "grad_norm": 0.9291492700576782,
66
- "learning_rate": 9.730593607305937e-06,
67
- "loss": 1.1238,
68
- "mean_token_accuracy": 0.748357442393899,
69
- "num_tokens": 64374.0,
70
- "step": 60
71
- },
72
- {
73
- "entropy": 1.2191850792616605,
74
- "epoch": 0.16,
75
- "grad_norm": 0.946483314037323,
76
- "learning_rate": 9.684931506849316e-06,
77
- "loss": 1.1395,
78
- "mean_token_accuracy": 0.7462461993098259,
79
- "num_tokens": 76109.0,
80
- "step": 70
81
- },
82
- {
83
- "entropy": 1.3447685081511735,
84
- "epoch": 0.18285714285714286,
85
- "grad_norm": 1.2081618309020996,
86
- "learning_rate": 9.639269406392696e-06,
87
- "loss": 1.2818,
88
- "mean_token_accuracy": 0.7141567166894675,
89
- "num_tokens": 84921.0,
90
- "step": 80
91
- },
92
- {
93
- "entropy": 1.5163870930671692,
94
- "epoch": 0.2057142857142857,
95
- "grad_norm": 1.1962831020355225,
96
- "learning_rate": 9.593607305936073e-06,
97
- "loss": 1.4127,
98
- "mean_token_accuracy": 0.6898491781204938,
99
- "num_tokens": 91475.0,
100
- "step": 90
101
- },
102
- {
103
- "entropy": 1.494850355386734,
104
- "epoch": 0.22857142857142856,
105
- "grad_norm": 1.8051788806915283,
106
- "learning_rate": 9.547945205479453e-06,
107
- "loss": 1.3539,
108
- "mean_token_accuracy": 0.7036820162087679,
109
- "num_tokens": 96433.0,
110
- "step": 100
111
- },
112
- {
113
- "entropy": 0.9693804103881121,
114
- "epoch": 0.25142857142857145,
115
- "grad_norm": 0.9580877423286438,
116
- "learning_rate": 9.502283105022831e-06,
117
- "loss": 1.0186,
118
- "mean_token_accuracy": 0.7684306014329195,
119
- "num_tokens": 112394.0,
120
- "step": 110
121
- },
122
- {
123
- "entropy": 1.090338883176446,
124
- "epoch": 0.2742857142857143,
125
- "grad_norm": 0.8812002539634705,
126
- "learning_rate": 9.456621004566212e-06,
127
- "loss": 1.0132,
128
- "mean_token_accuracy": 0.7664786655455827,
129
- "num_tokens": 123947.0,
130
- "step": 120
131
- },
132
- {
133
- "entropy": 1.284313677251339,
134
- "epoch": 0.29714285714285715,
135
- "grad_norm": 1.1698031425476074,
136
- "learning_rate": 9.41095890410959e-06,
137
- "loss": 1.2496,
138
- "mean_token_accuracy": 0.7140494808554649,
139
- "num_tokens": 132309.0,
140
- "step": 130
141
- },
142
- {
143
- "entropy": 1.3905419509857893,
144
- "epoch": 0.32,
145
- "grad_norm": 1.206275463104248,
146
- "learning_rate": 9.365296803652969e-06,
147
- "loss": 1.2838,
148
- "mean_token_accuracy": 0.7115083243697882,
149
- "num_tokens": 138781.0,
150
- "step": 140
151
- },
152
- {
153
- "entropy": 1.3588767245411872,
154
- "epoch": 0.34285714285714286,
155
- "grad_norm": 1.989784598350525,
156
- "learning_rate": 9.319634703196347e-06,
157
- "loss": 1.1625,
158
- "mean_token_accuracy": 0.7235500495880842,
159
- "num_tokens": 143734.0,
160
- "step": 150
161
- },
162
- {
163
- "entropy": 0.8994554949924349,
164
- "epoch": 0.3657142857142857,
165
- "grad_norm": 1.00913405418396,
166
- "learning_rate": 9.273972602739727e-06,
167
- "loss": 0.9263,
168
- "mean_token_accuracy": 0.7811420723795891,
169
- "num_tokens": 159877.0,
170
- "step": 160
171
- },
172
- {
173
- "entropy": 1.0330400079488755,
174
- "epoch": 0.38857142857142857,
175
- "grad_norm": 1.0430705547332764,
176
- "learning_rate": 9.228310502283106e-06,
177
- "loss": 0.9874,
178
- "mean_token_accuracy": 0.7636023428291082,
179
- "num_tokens": 171343.0,
180
- "step": 170
181
- },
182
- {
183
- "entropy": 1.1962346132844686,
184
- "epoch": 0.4114285714285714,
185
- "grad_norm": 1.3926544189453125,
186
- "learning_rate": 9.182648401826484e-06,
187
- "loss": 1.1651,
188
- "mean_token_accuracy": 0.7276976224035024,
189
- "num_tokens": 179829.0,
190
- "step": 180
191
- },
192
- {
193
- "entropy": 1.2758624110370875,
194
- "epoch": 0.4342857142857143,
195
- "grad_norm": 1.4376598596572876,
196
- "learning_rate": 9.136986301369863e-06,
197
- "loss": 1.1398,
198
- "mean_token_accuracy": 0.7325437176972628,
199
- "num_tokens": 186366.0,
200
- "step": 190
201
- },
202
- {
203
- "entropy": 1.2853564880788326,
204
- "epoch": 0.45714285714285713,
205
- "grad_norm": 2.2004787921905518,
206
- "learning_rate": 9.091324200913243e-06,
207
- "loss": 1.1638,
208
- "mean_token_accuracy": 0.7276943679898977,
209
- "num_tokens": 191365.0,
210
- "step": 200
211
- },
212
- {
213
- "entropy": 0.866673743724823,
214
- "epoch": 0.48,
215
- "grad_norm": 1.0414421558380127,
216
- "learning_rate": 9.045662100456622e-06,
217
- "loss": 0.8423,
218
- "mean_token_accuracy": 0.7982745975255966,
219
- "num_tokens": 207506.0,
220
- "step": 210
221
- },
222
- {
223
- "entropy": 0.947331802919507,
224
- "epoch": 0.5028571428571429,
225
- "grad_norm": 1.0546936988830566,
226
- "learning_rate": 9e-06,
227
- "loss": 0.9144,
228
- "mean_token_accuracy": 0.7818248048424721,
229
- "num_tokens": 219079.0,
230
- "step": 220
231
- },
232
- {
233
- "entropy": 1.1076377972960472,
234
- "epoch": 0.5257142857142857,
235
- "grad_norm": 1.488133430480957,
236
- "learning_rate": 8.954337899543379e-06,
237
- "loss": 1.0699,
238
- "mean_token_accuracy": 0.7396229557693005,
239
- "num_tokens": 227798.0,
240
- "step": 230
241
- },
242
- {
243
- "entropy": 1.2146507527679204,
244
- "epoch": 0.5485714285714286,
245
- "grad_norm": 1.393114686012268,
246
- "learning_rate": 8.908675799086759e-06,
247
- "loss": 1.1319,
248
- "mean_token_accuracy": 0.7367029923945665,
249
- "num_tokens": 234590.0,
250
- "step": 240
251
- },
252
- {
253
- "entropy": 1.2185490131378174,
254
- "epoch": 0.5714285714285714,
255
- "grad_norm": 2.391268730163574,
256
- "learning_rate": 8.863013698630137e-06,
257
- "loss": 1.0764,
258
- "mean_token_accuracy": 0.7414231035858393,
259
- "num_tokens": 239719.0,
260
- "step": 250
261
- },
262
- {
263
- "entropy": 0.8054604699835182,
264
- "epoch": 0.5942857142857143,
265
- "grad_norm": 0.9834737181663513,
266
- "learning_rate": 8.817351598173518e-06,
267
- "loss": 0.8053,
268
- "mean_token_accuracy": 0.7998832739889622,
269
- "num_tokens": 256472.0,
270
- "step": 260
271
- },
272
- {
273
- "entropy": 0.9120684009045362,
274
- "epoch": 0.6171428571428571,
275
- "grad_norm": 1.0815778970718384,
276
- "learning_rate": 8.771689497716896e-06,
277
- "loss": 0.8559,
278
- "mean_token_accuracy": 0.7897930487990379,
279
- "num_tokens": 268303.0,
280
- "step": 270
281
- },
282
- {
283
- "entropy": 1.0642446961253882,
284
- "epoch": 0.64,
285
- "grad_norm": 1.4160140752792358,
286
- "learning_rate": 8.726027397260275e-06,
287
- "loss": 1.0196,
288
- "mean_token_accuracy": 0.7551121093332768,
289
- "num_tokens": 277099.0,
290
- "step": 280
291
- },
292
- {
293
- "entropy": 1.1535940799862145,
294
- "epoch": 0.6628571428571428,
295
- "grad_norm": 1.6847327947616577,
296
- "learning_rate": 8.680365296803653e-06,
297
- "loss": 1.0552,
298
- "mean_token_accuracy": 0.7411838915199042,
299
- "num_tokens": 283815.0,
300
- "step": 290
301
- },
302
- {
303
- "entropy": 1.1651097811758517,
304
- "epoch": 0.6857142857142857,
305
- "grad_norm": 2.285900592803955,
306
- "learning_rate": 8.634703196347033e-06,
307
- "loss": 1.0267,
308
- "mean_token_accuracy": 0.7420264776796103,
309
- "num_tokens": 288831.0,
310
- "step": 300
311
- },
312
- {
313
- "entropy": 0.7967551020905376,
314
- "epoch": 0.7085714285714285,
315
- "grad_norm": 1.1952354907989502,
316
- "learning_rate": 8.589041095890412e-06,
317
- "loss": 0.8013,
318
- "mean_token_accuracy": 0.803380336984992,
319
- "num_tokens": 305255.0,
320
- "step": 310
321
- },
322
- {
323
- "entropy": 0.9132619671523571,
324
- "epoch": 0.7314285714285714,
325
- "grad_norm": 1.2885327339172363,
326
- "learning_rate": 8.54337899543379e-06,
327
- "loss": 0.8516,
328
- "mean_token_accuracy": 0.7869276314973831,
329
- "num_tokens": 316828.0,
330
- "step": 320
331
- },
332
- {
333
- "entropy": 1.020271310582757,
334
- "epoch": 0.7542857142857143,
335
- "grad_norm": 1.46247398853302,
336
- "learning_rate": 8.497716894977169e-06,
337
- "loss": 0.9943,
338
- "mean_token_accuracy": 0.7584239929914475,
339
- "num_tokens": 325557.0,
340
- "step": 330
341
- },
342
- {
343
- "entropy": 1.1350981347262858,
344
- "epoch": 0.7771428571428571,
345
- "grad_norm": 1.8628100156784058,
346
- "learning_rate": 8.45205479452055e-06,
347
- "loss": 1.0646,
348
- "mean_token_accuracy": 0.7359312813729048,
349
- "num_tokens": 332224.0,
350
- "step": 340
351
- },
352
- {
353
- "entropy": 1.1172638952732086,
354
- "epoch": 0.8,
355
- "grad_norm": 2.362889289855957,
356
- "learning_rate": 8.406392694063928e-06,
357
- "loss": 0.974,
358
- "mean_token_accuracy": 0.7591060597449542,
359
- "num_tokens": 337193.0,
360
- "step": 350
361
- },
362
- {
363
- "entropy": 0.7974750218912959,
364
- "epoch": 0.8228571428571428,
365
- "grad_norm": 1.1142443418502808,
366
- "learning_rate": 8.360730593607306e-06,
367
- "loss": 0.7776,
368
- "mean_token_accuracy": 0.8041313651949167,
369
- "num_tokens": 353595.0,
370
- "step": 360
371
- },
372
- {
373
- "entropy": 0.8807439863681793,
374
- "epoch": 0.8457142857142858,
375
- "grad_norm": 1.4073342084884644,
376
- "learning_rate": 8.315068493150685e-06,
377
- "loss": 0.8367,
378
- "mean_token_accuracy": 0.7917833589017391,
379
- "num_tokens": 365258.0,
380
- "step": 370
381
- },
382
- {
383
- "entropy": 1.0210479736328124,
384
- "epoch": 0.8685714285714285,
385
- "grad_norm": 1.506361484527588,
386
- "learning_rate": 8.269406392694065e-06,
387
- "loss": 0.9998,
388
- "mean_token_accuracy": 0.753864735737443,
389
- "num_tokens": 373694.0,
390
- "step": 380
391
- },
392
- {
393
- "entropy": 1.0973432060331105,
394
- "epoch": 0.8914285714285715,
395
- "grad_norm": 1.8420379161834717,
396
- "learning_rate": 8.223744292237444e-06,
397
- "loss": 1.0189,
398
- "mean_token_accuracy": 0.7553424458950758,
399
- "num_tokens": 380272.0,
400
- "step": 390
401
- },
402
- {
403
- "entropy": 1.1227020058780908,
404
- "epoch": 0.9142857142857143,
405
- "grad_norm": 2.7073919773101807,
406
- "learning_rate": 8.178082191780822e-06,
407
- "loss": 0.939,
408
- "mean_token_accuracy": 0.755080708488822,
409
- "num_tokens": 385214.0,
410
- "step": 400
411
- },
412
- {
413
- "entropy": 0.7808640262112021,
414
- "epoch": 0.9371428571428572,
415
- "grad_norm": 1.3160523176193237,
416
- "learning_rate": 8.1324200913242e-06,
417
- "loss": 0.7749,
418
- "mean_token_accuracy": 0.8022231217473745,
419
- "num_tokens": 400292.0,
420
- "step": 410
421
- },
422
- {
423
- "entropy": 0.9076132765039802,
424
- "epoch": 0.96,
425
- "grad_norm": 1.5909351110458374,
426
- "learning_rate": 8.08675799086758e-06,
427
- "loss": 0.8487,
428
- "mean_token_accuracy": 0.78284947052598,
429
- "num_tokens": 410576.0,
430
- "step": 420
431
- },
432
- {
433
- "entropy": 1.03597099930048,
434
- "epoch": 0.9828571428571429,
435
- "grad_norm": 1.9207741022109985,
436
- "learning_rate": 8.04109589041096e-06,
437
- "loss": 0.9651,
438
- "mean_token_accuracy": 0.7578862871974706,
439
- "num_tokens": 417570.0,
440
- "step": 430
441
- },
442
- {
443
- "epoch": 1.0,
444
- "eval_accuracy": 0.008388412892696859,
445
- "eval_entropy": 0.9678071262063207,
446
- "eval_loss": 1.2615772485733032,
447
- "eval_mean_token_accuracy": 0.7336987892172971,
448
- "eval_num_tokens": 421194.0,
449
- "eval_runtime": 298.9526,
450
- "eval_samples_per_second": 3.459,
451
- "eval_steps_per_second": 0.866,
452
- "step": 438
453
- },
454
- {
455
- "entropy": 0.9707203358411789,
456
- "epoch": 1.0045714285714287,
457
- "grad_norm": 1.046062707901001,
458
- "learning_rate": 7.99543378995434e-06,
459
- "loss": 0.8608,
460
- "mean_token_accuracy": 0.7782945515293824,
461
- "num_tokens": 425275.0,
462
- "step": 440
463
- },
464
- {
465
- "entropy": 0.7523622503504157,
466
- "epoch": 1.0274285714285714,
467
- "grad_norm": 1.2251920700073242,
468
- "learning_rate": 7.949771689497718e-06,
469
- "loss": 0.7473,
470
- "mean_token_accuracy": 0.8092356324195862,
471
- "num_tokens": 439850.0,
472
- "step": 450
473
- },
474
- {
475
- "entropy": 0.84888547193259,
476
- "epoch": 1.0502857142857143,
477
- "grad_norm": 1.325343370437622,
478
- "learning_rate": 7.904109589041097e-06,
479
- "loss": 0.7907,
480
- "mean_token_accuracy": 0.7977306388318539,
481
- "num_tokens": 451136.0,
482
- "step": 460
483
- },
484
- {
485
- "entropy": 0.9735806178301573,
486
- "epoch": 1.0731428571428572,
487
- "grad_norm": 1.6439032554626465,
488
- "learning_rate": 7.858447488584475e-06,
489
- "loss": 0.9513,
490
- "mean_token_accuracy": 0.7617030199617147,
491
- "num_tokens": 459486.0,
492
- "step": 470
493
- },
494
- {
495
- "entropy": 1.0277788739651441,
496
- "epoch": 1.096,
497
- "grad_norm": 1.8182581663131714,
498
- "learning_rate": 7.812785388127855e-06,
499
- "loss": 0.9526,
500
- "mean_token_accuracy": 0.7630892738699913,
501
- "num_tokens": 465975.0,
502
- "step": 480
503
- },
504
- {
505
- "entropy": 0.943339848332107,
506
- "epoch": 1.1188571428571428,
507
- "grad_norm": 1.1697943210601807,
508
- "learning_rate": 7.767123287671234e-06,
509
- "loss": 0.851,
510
- "mean_token_accuracy": 0.7827658370137215,
511
- "num_tokens": 473929.0,
512
- "step": 490
513
- },
514
- {
515
- "entropy": 0.7733788685873151,
516
- "epoch": 1.1417142857142857,
517
- "grad_norm": 1.2411632537841797,
518
- "learning_rate": 7.721461187214612e-06,
519
- "loss": 0.7691,
520
- "mean_token_accuracy": 0.8111804500222206,
521
- "num_tokens": 488306.0,
522
- "step": 500
523
- },
524
- {
525
- "entropy": 0.824394048191607,
526
- "epoch": 1.1645714285714286,
527
- "grad_norm": 1.3971821069717407,
528
- "learning_rate": 7.675799086757991e-06,
529
- "loss": 0.7429,
530
- "mean_token_accuracy": 0.8088308341801167,
531
- "num_tokens": 499208.0,
532
- "step": 510
533
- },
534
- {
535
- "entropy": 0.9718045836314559,
536
- "epoch": 1.1874285714285715,
537
- "grad_norm": 1.7808269262313843,
538
- "learning_rate": 7.630136986301371e-06,
539
- "loss": 0.9365,
540
- "mean_token_accuracy": 0.762304800376296,
541
- "num_tokens": 507299.0,
542
- "step": 520
543
- },
544
- {
545
- "entropy": 0.9984221205115318,
546
- "epoch": 1.2102857142857144,
547
- "grad_norm": 2.0445668697357178,
548
- "learning_rate": 7.58447488584475e-06,
549
- "loss": 0.9299,
550
- "mean_token_accuracy": 0.7667763099074364,
551
- "num_tokens": 513558.0,
552
- "step": 530
553
- },
554
- {
555
- "entropy": 0.9143548993393779,
556
- "epoch": 1.233142857142857,
557
- "grad_norm": 1.4540475606918335,
558
- "learning_rate": 7.538812785388129e-06,
559
- "loss": 0.7977,
560
- "mean_token_accuracy": 0.7940818261355161,
561
- "num_tokens": 521477.0,
562
- "step": 540
563
- },
564
- {
565
- "entropy": 0.7225218357518315,
566
- "epoch": 1.256,
567
- "grad_norm": 1.4530831575393677,
568
- "learning_rate": 7.4931506849315075e-06,
569
- "loss": 0.7282,
570
- "mean_token_accuracy": 0.8143456902354955,
571
- "num_tokens": 536156.0,
572
- "step": 550
573
- },
574
- {
575
- "entropy": 0.826616644486785,
576
- "epoch": 1.278857142857143,
577
- "grad_norm": 1.307198166847229,
578
- "learning_rate": 7.447488584474887e-06,
579
- "loss": 0.7509,
580
- "mean_token_accuracy": 0.8082947298884392,
581
- "num_tokens": 547250.0,
582
- "step": 560
583
- },
584
- {
585
- "entropy": 0.9597857438027859,
586
- "epoch": 1.3017142857142856,
587
- "grad_norm": 2.1991994380950928,
588
- "learning_rate": 7.401826484018265e-06,
589
- "loss": 0.9294,
590
- "mean_token_accuracy": 0.7654231000691653,
591
- "num_tokens": 555523.0,
592
- "step": 570
593
- },
594
- {
595
- "entropy": 1.0113731533288957,
596
- "epoch": 1.3245714285714285,
597
- "grad_norm": 2.2134881019592285,
598
- "learning_rate": 7.356164383561645e-06,
599
- "loss": 0.9149,
600
- "mean_token_accuracy": 0.7716960549354553,
601
- "num_tokens": 561790.0,
602
- "step": 580
603
- },
604
- {
605
- "entropy": 0.89712286721915,
606
- "epoch": 1.3474285714285714,
607
- "grad_norm": 1.2084845304489136,
608
- "learning_rate": 7.310502283105023e-06,
609
- "loss": 0.7891,
610
- "mean_token_accuracy": 0.7911178763955832,
611
- "num_tokens": 569844.0,
612
- "step": 590
613
- },
614
- {
615
- "entropy": 0.7331022916361689,
616
- "epoch": 1.3702857142857143,
617
- "grad_norm": 1.3023542165756226,
618
- "learning_rate": 7.269406392694065e-06,
619
- "loss": 0.7457,
620
- "mean_token_accuracy": 0.8113049529492855,
621
- "num_tokens": 584459.0,
622
- "step": 600
623
- },
624
- {
625
- "entropy": 0.7879349924623966,
626
- "epoch": 1.3931428571428572,
627
- "grad_norm": 1.555379867553711,
628
- "learning_rate": 7.223744292237444e-06,
629
- "loss": 0.7306,
630
- "mean_token_accuracy": 0.8145640216767788,
631
- "num_tokens": 595804.0,
632
- "step": 610
633
- },
634
- {
635
- "entropy": 0.9201723251491785,
636
- "epoch": 1.416,
637
- "grad_norm": 2.0131261348724365,
638
- "learning_rate": 7.178082191780823e-06,
639
- "loss": 0.881,
640
- "mean_token_accuracy": 0.7761596899479628,
641
- "num_tokens": 604098.0,
642
- "step": 620
643
- },
644
- {
645
- "entropy": 1.0043058268725873,
646
- "epoch": 1.4388571428571428,
647
- "grad_norm": 1.952837586402893,
648
- "learning_rate": 7.132420091324202e-06,
649
- "loss": 0.9229,
650
- "mean_token_accuracy": 0.7723097205162048,
651
- "num_tokens": 610481.0,
652
- "step": 630
653
- },
654
- {
655
- "entropy": 0.8940085913985968,
656
- "epoch": 1.4617142857142857,
657
- "grad_norm": 1.2801399230957031,
658
- "learning_rate": 7.086757990867581e-06,
659
- "loss": 0.8006,
660
- "mean_token_accuracy": 0.7930479496717453,
661
- "num_tokens": 618699.0,
662
- "step": 640
663
- },
664
- {
665
- "entropy": 0.6966889450326562,
666
- "epoch": 1.4845714285714287,
667
- "grad_norm": 1.557562232017517,
668
- "learning_rate": 7.0410958904109596e-06,
669
- "loss": 0.665,
670
- "mean_token_accuracy": 0.8264754865318537,
671
- "num_tokens": 632856.0,
672
- "step": 650
673
- },
674
- {
675
- "entropy": 0.8100471086800098,
676
- "epoch": 1.5074285714285716,
677
- "grad_norm": 1.7616751194000244,
678
- "learning_rate": 6.995433789954339e-06,
679
- "loss": 0.7669,
680
- "mean_token_accuracy": 0.8096333492547274,
681
- "num_tokens": 643712.0,
682
- "step": 660
683
- },
684
- {
685
- "entropy": 0.9476521443575621,
686
- "epoch": 1.5302857142857142,
687
- "grad_norm": 1.97320556640625,
688
- "learning_rate": 6.9497716894977175e-06,
689
- "loss": 0.8769,
690
- "mean_token_accuracy": 0.7822451706975698,
691
- "num_tokens": 651732.0,
692
- "step": 670
693
- },
694
- {
695
- "entropy": 0.9541807420551777,
696
- "epoch": 1.5531428571428572,
697
- "grad_norm": 2.2813711166381836,
698
- "learning_rate": 6.904109589041097e-06,
699
- "loss": 0.8731,
700
- "mean_token_accuracy": 0.7764547783881426,
701
- "num_tokens": 658104.0,
702
- "step": 680
703
- },
704
- {
705
- "entropy": 0.8891686601564288,
706
- "epoch": 1.576,
707
- "grad_norm": 1.2347137928009033,
708
- "learning_rate": 6.858447488584475e-06,
709
- "loss": 0.8099,
710
- "mean_token_accuracy": 0.795854776352644,
711
- "num_tokens": 666681.0,
712
- "step": 690
713
- },
714
- {
715
- "entropy": 0.7062053712084889,
716
- "epoch": 1.5988571428571428,
717
- "grad_norm": 1.505817174911499,
718
- "learning_rate": 6.812785388127855e-06,
719
- "loss": 0.6689,
720
- "mean_token_accuracy": 0.8225430808961391,
721
- "num_tokens": 681161.0,
722
- "step": 700
723
- },
724
- {
725
- "entropy": 0.7627910353243351,
726
- "epoch": 1.6217142857142857,
727
- "grad_norm": 1.7354750633239746,
728
- "learning_rate": 6.767123287671233e-06,
729
- "loss": 0.7217,
730
- "mean_token_accuracy": 0.8088484812527895,
731
- "num_tokens": 692262.0,
732
- "step": 710
733
- },
734
- {
735
- "entropy": 0.9181301448494196,
736
- "epoch": 1.6445714285714286,
737
- "grad_norm": 1.9427331686019897,
738
- "learning_rate": 6.721461187214613e-06,
739
- "loss": 0.8664,
740
- "mean_token_accuracy": 0.7764203164726495,
741
- "num_tokens": 700252.0,
742
- "step": 720
743
- },
744
- {
745
- "entropy": 0.970825233310461,
746
- "epoch": 1.6674285714285715,
747
- "grad_norm": 2.231489419937134,
748
- "learning_rate": 6.675799086757991e-06,
749
- "loss": 0.8727,
750
- "mean_token_accuracy": 0.77991351634264,
751
- "num_tokens": 706466.0,
752
- "step": 730
753
- },
754
- {
755
- "entropy": 0.8769128978252411,
756
- "epoch": 1.6902857142857144,
757
- "grad_norm": 1.3580577373504639,
758
- "learning_rate": 6.630136986301371e-06,
759
- "loss": 0.7826,
760
- "mean_token_accuracy": 0.7997685220092535,
761
- "num_tokens": 714701.0,
762
- "step": 740
763
- },
764
- {
765
- "entropy": 0.6923451218754053,
766
- "epoch": 1.713142857142857,
767
- "grad_norm": 1.4095361232757568,
768
- "learning_rate": 6.584474885844749e-06,
769
- "loss": 0.6984,
770
- "mean_token_accuracy": 0.8204937841743231,
771
- "num_tokens": 729622.0,
772
- "step": 750
773
- },
774
- {
775
- "entropy": 0.7426450593397022,
776
- "epoch": 1.736,
777
- "grad_norm": 1.5736570358276367,
778
- "learning_rate": 6.538812785388129e-06,
779
- "loss": 0.667,
780
- "mean_token_accuracy": 0.8291565012186766,
781
- "num_tokens": 740772.0,
782
- "step": 760
783
- },
784
- {
785
- "entropy": 0.910079357214272,
786
- "epoch": 1.758857142857143,
787
- "grad_norm": 2.1047656536102295,
788
- "learning_rate": 6.493150684931508e-06,
789
- "loss": 0.875,
790
- "mean_token_accuracy": 0.7781037461012602,
791
- "num_tokens": 748857.0,
792
- "step": 770
793
- },
794
- {
795
- "entropy": 0.9749910116195679,
796
- "epoch": 1.7817142857142856,
797
- "grad_norm": 2.2609705924987793,
798
- "learning_rate": 6.447488584474887e-06,
799
- "loss": 0.9058,
800
- "mean_token_accuracy": 0.7749961122870446,
801
- "num_tokens": 755273.0,
802
- "step": 780
803
- },
804
- {
805
- "entropy": 0.8688624935224653,
806
- "epoch": 1.8045714285714287,
807
- "grad_norm": 2.156954765319824,
808
- "learning_rate": 6.401826484018266e-06,
809
- "loss": 0.7568,
810
- "mean_token_accuracy": 0.8001648161560297,
811
- "num_tokens": 763404.0,
812
- "step": 790
813
- },
814
- {
815
- "entropy": 0.6553533479571343,
816
- "epoch": 1.8274285714285714,
817
- "grad_norm": 1.5286246538162231,
818
- "learning_rate": 6.356164383561645e-06,
819
- "loss": 0.6357,
820
- "mean_token_accuracy": 0.8322514686733484,
821
- "num_tokens": 777652.0,
822
- "step": 800
823
- },
824
- {
825
- "entropy": 0.7381465582177043,
826
- "epoch": 1.8502857142857143,
827
- "grad_norm": 1.889930248260498,
828
- "learning_rate": 6.3105022831050235e-06,
829
- "loss": 0.6995,
830
- "mean_token_accuracy": 0.8194405883550644,
831
- "num_tokens": 788541.0,
832
- "step": 810
833
- },
834
- {
835
- "entropy": 0.9207667458802462,
836
- "epoch": 1.8731428571428572,
837
- "grad_norm": 2.3677663803100586,
838
- "learning_rate": 6.264840182648403e-06,
839
- "loss": 0.876,
840
- "mean_token_accuracy": 0.7714111492037773,
841
- "num_tokens": 796574.0,
842
- "step": 820
843
- },
844
- {
845
- "entropy": 0.9494761880487204,
846
- "epoch": 1.896,
847
- "grad_norm": 2.424638032913208,
848
- "learning_rate": 6.219178082191781e-06,
849
- "loss": 0.8548,
850
- "mean_token_accuracy": 0.7811690699309111,
851
- "num_tokens": 802836.0,
852
- "step": 830
853
- },
854
- {
855
- "entropy": 0.8909835416823626,
856
- "epoch": 1.9188571428571428,
857
- "grad_norm": 1.3449039459228516,
858
- "learning_rate": 6.173515981735161e-06,
859
- "loss": 0.7825,
860
- "mean_token_accuracy": 0.7954777158796787,
861
- "num_tokens": 810726.0,
862
- "step": 840
863
- },
864
- {
865
- "entropy": 0.6921561988070607,
866
- "epoch": 1.9417142857142857,
867
- "grad_norm": 1.490689992904663,
868
- "learning_rate": 6.127853881278539e-06,
869
- "loss": 0.6554,
870
- "mean_token_accuracy": 0.8262197155505419,
871
- "num_tokens": 824145.0,
872
- "step": 850
873
- },
874
- {
875
- "entropy": 0.8137379666790366,
876
- "epoch": 1.9645714285714284,
877
- "grad_norm": 2.0120434761047363,
878
- "learning_rate": 6.082191780821919e-06,
879
- "loss": 0.8024,
880
- "mean_token_accuracy": 0.7950452182441949,
881
- "num_tokens": 833220.0,
882
- "step": 860
883
- },
884
- {
885
- "entropy": 0.9502449594438076,
886
- "epoch": 1.9874285714285715,
887
- "grad_norm": 2.679570198059082,
888
- "learning_rate": 6.036529680365297e-06,
889
- "loss": 0.8545,
890
- "mean_token_accuracy": 0.781839894503355,
891
- "num_tokens": 839758.0,
892
- "step": 870
893
- },
894
- {
895
- "epoch": 2.0,
896
- "eval_accuracy": 0.00894328845369237,
897
- "eval_entropy": 0.9275054344799528,
898
- "eval_loss": 1.3659894466400146,
899
- "eval_mean_token_accuracy": 0.7276899333626147,
900
- "eval_num_tokens": 842388.0,
901
- "eval_runtime": 299.6651,
902
- "eval_samples_per_second": 3.451,
903
- "eval_steps_per_second": 0.864,
904
- "step": 876
905
- },
906
- {
907
- "entropy": 0.5898892944678664,
908
- "epoch": 2.0091428571428573,
909
- "grad_norm": 1.3116419315338135,
910
- "learning_rate": 5.990867579908676e-06,
911
- "loss": 0.5677,
912
- "mean_token_accuracy": 0.8516398537904024,
913
- "num_tokens": 7804.0,
914
- "step": 880
915
- },
916
- {
917
- "entropy": 0.6767458073794842,
918
- "epoch": 2.032,
919
- "grad_norm": 1.4093241691589355,
920
- "learning_rate": 5.945205479452055e-06,
921
- "loss": 0.6416,
922
- "mean_token_accuracy": 0.8335339192301034,
923
- "num_tokens": 21400.0,
924
- "step": 890
925
- },
926
- {
927
- "entropy": 0.7389240754768253,
928
- "epoch": 2.0548571428571427,
929
- "grad_norm": 1.8839222192764282,
930
- "learning_rate": 5.8995433789954336e-06,
931
- "loss": 0.6714,
932
- "mean_token_accuracy": 0.8232478138059378,
933
- "num_tokens": 31986.0,
934
- "step": 900
935
- },
936
- {
937
- "entropy": 0.8426970480009913,
938
- "epoch": 2.077714285714286,
939
- "grad_norm": 2.546990394592285,
940
- "learning_rate": 5.853881278538813e-06,
941
- "loss": 0.773,
942
- "mean_token_accuracy": 0.7947055101394653,
943
- "num_tokens": 39598.0,
944
- "step": 910
945
- },
946
- {
947
- "entropy": 0.9150473427027463,
948
- "epoch": 2.1005714285714285,
949
- "grad_norm": 2.7457187175750732,
950
- "learning_rate": 5.8082191780821915e-06,
951
- "loss": 0.8326,
952
- "mean_token_accuracy": 0.7829385627061128,
953
- "num_tokens": 45726.0,
954
- "step": 920
955
- },
956
- {
957
- "entropy": 0.7779074914753437,
958
- "epoch": 2.123428571428571,
959
- "grad_norm": 1.6818033456802368,
960
- "learning_rate": 5.762557077625572e-06,
961
- "loss": 0.6953,
962
- "mean_token_accuracy": 0.8222391355782748,
963
- "num_tokens": 55996.0,
964
- "step": 930
965
- },
966
- {
967
- "entropy": 0.6338619258254766,
968
- "epoch": 2.1462857142857144,
969
- "grad_norm": 1.8489398956298828,
970
- "learning_rate": 5.716894977168949e-06,
971
- "loss": 0.5875,
972
- "mean_token_accuracy": 0.8410690952092409,
973
- "num_tokens": 69720.0,
974
- "step": 940
975
- },
976
- {
977
- "entropy": 0.7125289073213935,
978
- "epoch": 2.169142857142857,
979
- "grad_norm": 1.8807828426361084,
980
- "learning_rate": 5.6712328767123296e-06,
981
- "loss": 0.6763,
982
- "mean_token_accuracy": 0.8223242565989495,
983
- "num_tokens": 80327.0,
984
- "step": 950
985
- },
986
- {
987
- "entropy": 0.8898364685475826,
988
- "epoch": 2.192,
989
- "grad_norm": 2.358139753341675,
990
- "learning_rate": 5.625570776255708e-06,
991
- "loss": 0.8491,
992
- "mean_token_accuracy": 0.7844949930906295,
993
- "num_tokens": 88242.0,
994
- "step": 960
995
- },
996
- {
997
- "entropy": 0.8897234506905078,
998
- "epoch": 2.214857142857143,
999
- "grad_norm": 2.5401251316070557,
1000
- "learning_rate": 5.5799086757990874e-06,
1001
- "loss": 0.786,
1002
- "mean_token_accuracy": 0.795602411031723,
1003
- "num_tokens": 94381.0,
1004
- "step": 970
1005
- },
1006
- {
1007
- "entropy": 0.7333379179239273,
1008
- "epoch": 2.2377142857142855,
1009
- "grad_norm": 1.7071613073349,
1010
- "learning_rate": 5.534246575342466e-06,
1011
- "loss": 0.665,
1012
- "mean_token_accuracy": 0.8248766608536243,
1013
- "num_tokens": 105014.0,
1014
- "step": 980
1015
- },
1016
- {
1017
- "entropy": 0.6419918712228536,
1018
- "epoch": 2.2605714285714287,
1019
- "grad_norm": 1.5552293062210083,
1020
- "learning_rate": 5.488584474885845e-06,
1021
- "loss": 0.6026,
1022
- "mean_token_accuracy": 0.8414491657167673,
1023
- "num_tokens": 118550.0,
1024
- "step": 990
1025
- },
1026
- {
1027
- "entropy": 0.7310339482501149,
1028
- "epoch": 2.2834285714285714,
1029
- "grad_norm": 2.278031587600708,
1030
- "learning_rate": 5.442922374429224e-06,
1031
- "loss": 0.6805,
1032
- "mean_token_accuracy": 0.8224124182015657,
1033
- "num_tokens": 128693.0,
1034
- "step": 1000
1035
- },
1036
- {
1037
- "entropy": 0.880455293878913,
1038
- "epoch": 2.306285714285714,
1039
- "grad_norm": 2.459608554840088,
1040
- "learning_rate": 5.397260273972603e-06,
1041
- "loss": 0.8336,
1042
- "mean_token_accuracy": 0.7825992915779352,
1043
- "num_tokens": 136122.0,
1044
- "step": 1010
1045
- },
1046
- {
1047
- "entropy": 0.8820295415818691,
1048
- "epoch": 2.329142857142857,
1049
- "grad_norm": 2.6355550289154053,
1050
- "learning_rate": 5.351598173515982e-06,
1051
- "loss": 0.7887,
1052
- "mean_token_accuracy": 0.7964828334748745,
1053
- "num_tokens": 142194.0,
1054
- "step": 1020
1055
- },
1056
- {
1057
- "entropy": 0.7468231266364456,
1058
- "epoch": 2.352,
1059
- "grad_norm": 1.3787378072738647,
1060
- "learning_rate": 5.305936073059361e-06,
1061
- "loss": 0.6603,
1062
- "mean_token_accuracy": 0.8238665115088224,
1063
- "num_tokens": 152715.0,
1064
- "step": 1030
1065
- },
1066
- {
1067
- "entropy": 0.6317665258422493,
1068
- "epoch": 2.374857142857143,
1069
- "grad_norm": 1.7379688024520874,
1070
- "learning_rate": 5.26027397260274e-06,
1071
- "loss": 0.6007,
1072
- "mean_token_accuracy": 0.8413413379341363,
1073
- "num_tokens": 166545.0,
1074
- "step": 1040
1075
- },
1076
- {
1077
- "entropy": 0.7211008200421929,
1078
- "epoch": 2.3977142857142857,
1079
- "grad_norm": 2.5596601963043213,
1080
- "learning_rate": 5.214611872146119e-06,
1081
- "loss": 0.656,
1082
- "mean_token_accuracy": 0.8258481413125992,
1083
- "num_tokens": 177411.0,
1084
- "step": 1050
1085
- },
1086
- {
1087
- "entropy": 0.8763142567127943,
1088
- "epoch": 2.420571428571429,
1089
- "grad_norm": 2.739737033843994,
1090
- "learning_rate": 5.1689497716894975e-06,
1091
- "loss": 0.8354,
1092
- "mean_token_accuracy": 0.7820440270006657,
1093
- "num_tokens": 185340.0,
1094
- "step": 1060
1095
- },
1096
- {
1097
- "entropy": 0.8908181961625814,
1098
- "epoch": 2.4434285714285715,
1099
- "grad_norm": 3.223233222961426,
1100
- "learning_rate": 5.123287671232877e-06,
1101
- "loss": 0.7972,
1102
- "mean_token_accuracy": 0.79459448158741,
1103
- "num_tokens": 191540.0,
1104
- "step": 1070
1105
- },
1106
- {
1107
- "entropy": 0.7510069858282804,
1108
- "epoch": 2.466285714285714,
1109
- "grad_norm": 1.703507661819458,
1110
- "learning_rate": 5.077625570776255e-06,
1111
- "loss": 0.6854,
1112
- "mean_token_accuracy": 0.8139259118586779,
1113
- "num_tokens": 201907.0,
1114
- "step": 1080
1115
- },
1116
- {
1117
- "entropy": 0.6569937597960234,
1118
- "epoch": 2.4891428571428573,
1119
- "grad_norm": 1.7559291124343872,
1120
- "learning_rate": 5.031963470319635e-06,
1121
- "loss": 0.6111,
1122
- "mean_token_accuracy": 0.84332409016788,
1123
- "num_tokens": 215231.0,
1124
- "step": 1090
1125
- },
1126
- {
1127
- "entropy": 0.7254189381375908,
1128
- "epoch": 2.512,
1129
- "grad_norm": 1.9119453430175781,
1130
- "learning_rate": 4.986301369863014e-06,
1131
- "loss": 0.6814,
1132
- "mean_token_accuracy": 0.8200360022485256,
1133
- "num_tokens": 225539.0,
1134
- "step": 1100
1135
- },
1136
- {
1137
- "entropy": 0.8703744746744633,
1138
- "epoch": 2.5348571428571427,
1139
- "grad_norm": 2.812936305999756,
1140
- "learning_rate": 4.9406392694063935e-06,
1141
- "loss": 0.8006,
1142
- "mean_token_accuracy": 0.7889351420104503,
1143
- "num_tokens": 233013.0,
1144
- "step": 1110
1145
- },
1146
- {
1147
- "entropy": 0.8793116342276335,
1148
- "epoch": 2.557714285714286,
1149
- "grad_norm": 3.227419137954712,
1150
- "learning_rate": 4.894977168949772e-06,
1151
- "loss": 0.7693,
1152
- "mean_token_accuracy": 0.7998475536704064,
1153
- "num_tokens": 238974.0,
1154
- "step": 1120
1155
- },
1156
- {
1157
- "entropy": 0.7303921280428767,
1158
- "epoch": 2.5805714285714285,
1159
- "grad_norm": 1.6300448179244995,
1160
- "learning_rate": 4.849315068493151e-06,
1161
- "loss": 0.6538,
1162
- "mean_token_accuracy": 0.8278157886117696,
1163
- "num_tokens": 249037.0,
1164
- "step": 1130
1165
- },
1166
- {
1167
- "entropy": 0.6417382193729282,
1168
- "epoch": 2.603428571428571,
1169
- "grad_norm": 1.6912339925765991,
1170
- "learning_rate": 4.80365296803653e-06,
1171
- "loss": 0.6108,
1172
- "mean_token_accuracy": 0.8381971474736929,
1173
- "num_tokens": 262395.0,
1174
- "step": 1140
1175
- },
1176
- {
1177
- "entropy": 0.7339965717867016,
1178
- "epoch": 2.6262857142857143,
1179
- "grad_norm": 2.330716371536255,
1180
- "learning_rate": 4.757990867579909e-06,
1181
- "loss": 0.6772,
1182
- "mean_token_accuracy": 0.8194692388176918,
1183
- "num_tokens": 272694.0,
1184
- "step": 1150
1185
- },
1186
- {
1187
- "entropy": 0.8351591594517231,
1188
- "epoch": 2.649142857142857,
1189
- "grad_norm": 2.8293557167053223,
1190
- "learning_rate": 4.712328767123288e-06,
1191
- "loss": 0.7642,
1192
- "mean_token_accuracy": 0.7966304961591959,
1193
- "num_tokens": 280231.0,
1194
- "step": 1160
1195
- },
1196
- {
1197
- "entropy": 0.8861342877149582,
1198
- "epoch": 2.672,
1199
- "grad_norm": 2.9575674533843994,
1200
- "learning_rate": 4.666666666666667e-06,
1201
- "loss": 0.7999,
1202
- "mean_token_accuracy": 0.7918535027652979,
1203
- "num_tokens": 286161.0,
1204
- "step": 1170
1205
- },
1206
- {
1207
- "entropy": 0.7180219950154424,
1208
- "epoch": 2.694857142857143,
1209
- "grad_norm": 1.5886666774749756,
1210
- "learning_rate": 4.6210045662100465e-06,
1211
- "loss": 0.6434,
1212
- "mean_token_accuracy": 0.8283716265112162,
1213
- "num_tokens": 296388.0,
1214
- "step": 1180
1215
- },
1216
- {
1217
- "entropy": 0.6133397184312344,
1218
- "epoch": 2.717714285714286,
1219
- "grad_norm": 1.8250149488449097,
1220
- "learning_rate": 4.575342465753425e-06,
1221
- "loss": 0.6059,
1222
- "mean_token_accuracy": 0.8467012654989958,
1223
- "num_tokens": 310217.0,
1224
- "step": 1190
1225
- },
1226
- {
1227
- "entropy": 0.6812281895428896,
1228
- "epoch": 2.7405714285714287,
1229
- "grad_norm": 2.336768627166748,
1230
- "learning_rate": 4.529680365296804e-06,
1231
- "loss": 0.6216,
1232
- "mean_token_accuracy": 0.8333451252430677,
1233
- "num_tokens": 320782.0,
1234
- "step": 1200
1235
- },
1236
- {
1237
- "entropy": 0.8352824920788408,
1238
- "epoch": 2.7634285714285713,
1239
- "grad_norm": 2.4751791954040527,
1240
- "learning_rate": 4.484018264840183e-06,
1241
- "loss": 0.7816,
1242
- "mean_token_accuracy": 0.7953334752470255,
1243
- "num_tokens": 328520.0,
1244
- "step": 1210
1245
- },
1246
- {
1247
- "entropy": 0.8553074564784765,
1248
- "epoch": 2.7862857142857145,
1249
- "grad_norm": 3.6519722938537598,
1250
- "learning_rate": 4.438356164383562e-06,
1251
- "loss": 0.7807,
1252
- "mean_token_accuracy": 0.7970743294805288,
1253
- "num_tokens": 334489.0,
1254
- "step": 1220
1255
- },
1256
- {
1257
- "entropy": 0.7267135815694928,
1258
- "epoch": 2.809142857142857,
1259
- "grad_norm": 1.6625852584838867,
1260
- "learning_rate": 4.392694063926941e-06,
1261
- "loss": 0.6521,
1262
- "mean_token_accuracy": 0.8255741696804761,
1263
- "num_tokens": 344543.0,
1264
- "step": 1230
1265
- },
1266
- {
1267
- "entropy": 0.6338536148890853,
1268
- "epoch": 2.832,
1269
- "grad_norm": 1.9026601314544678,
1270
- "learning_rate": 4.34703196347032e-06,
1271
- "loss": 0.5962,
1272
- "mean_token_accuracy": 0.8427884597331285,
1273
- "num_tokens": 357986.0,
1274
- "step": 1240
1275
- },
1276
- {
1277
- "entropy": 0.7158672915771603,
1278
- "epoch": 2.854857142857143,
1279
- "grad_norm": 2.288316488265991,
1280
- "learning_rate": 4.301369863013699e-06,
1281
- "loss": 0.6478,
1282
- "mean_token_accuracy": 0.8221574258059263,
1283
- "num_tokens": 368102.0,
1284
- "step": 1250
1285
- },
1286
- {
1287
- "entropy": 0.8342319210991264,
1288
- "epoch": 2.8777142857142857,
1289
- "grad_norm": 2.675821542739868,
1290
- "learning_rate": 4.255707762557078e-06,
1291
- "loss": 0.7634,
1292
- "mean_token_accuracy": 0.8008246626704931,
1293
- "num_tokens": 375682.0,
1294
- "step": 1260
1295
- },
1296
- {
1297
- "entropy": 0.8324337769299746,
1298
- "epoch": 2.9005714285714284,
1299
- "grad_norm": 3.794491767883301,
1300
- "learning_rate": 4.2100456621004574e-06,
1301
- "loss": 0.7409,
1302
- "mean_token_accuracy": 0.8102138575166464,
1303
- "num_tokens": 381707.0,
1304
- "step": 1270
1305
- },
1306
- {
1307
- "entropy": 0.7096458308398723,
1308
- "epoch": 2.9234285714285715,
1309
- "grad_norm": 1.945020318031311,
1310
- "learning_rate": 4.164383561643836e-06,
1311
- "loss": 0.6394,
1312
- "mean_token_accuracy": 0.8287061709910631,
1313
- "num_tokens": 391884.0,
1314
- "step": 1280
1315
- },
1316
- {
1317
- "entropy": 0.6416196620091796,
1318
- "epoch": 2.946285714285714,
1319
- "grad_norm": 2.1223883628845215,
1320
- "learning_rate": 4.118721461187215e-06,
1321
- "loss": 0.613,
1322
- "mean_token_accuracy": 0.837811603397131,
1323
- "num_tokens": 404730.0,
1324
- "step": 1290
1325
- },
1326
- {
1327
- "entropy": 0.7876641971990466,
1328
- "epoch": 2.9691428571428573,
1329
- "grad_norm": 3.030888795852661,
1330
- "learning_rate": 4.073059360730594e-06,
1331
- "loss": 0.7228,
1332
- "mean_token_accuracy": 0.8122910633683205,
1333
- "num_tokens": 413461.0,
1334
- "step": 1300
1335
- },
1336
- {
1337
- "entropy": 0.8694131746888161,
1338
- "epoch": 2.992,
1339
- "grad_norm": 2.9641623497009277,
1340
- "learning_rate": 4.027397260273973e-06,
1341
- "loss": 0.793,
1342
- "mean_token_accuracy": 0.7905130475759506,
1343
- "num_tokens": 419636.0,
1344
- "step": 1310
1345
- },
1346
- {
1347
- "epoch": 3.0,
1348
- "eval_accuracy": 0.009547123623011015,
1349
- "eval_entropy": 0.8601508936826787,
1350
- "eval_loss": 1.4408637285232544,
1351
- "eval_mean_token_accuracy": 0.7276363938931792,
1352
- "eval_num_tokens": 421194.0,
1353
- "eval_runtime": 323.9597,
1354
- "eval_samples_per_second": 3.192,
1355
- "eval_steps_per_second": 0.799,
1356
- "step": 1314
1357
- },
1358
- {
1359
- "entropy": 0.5390967978164554,
1360
- "epoch": 3.013714285714286,
1361
- "grad_norm": 1.706173300743103,
1362
- "learning_rate": 3.9863013698630135e-06,
1363
- "loss": 0.5134,
1364
- "mean_token_accuracy": 0.8650896890709797,
1365
- "num_tokens": 10877.0,
1366
- "step": 1320
1367
- },
1368
- {
1369
- "entropy": 0.5993562566116453,
1370
- "epoch": 3.0365714285714285,
1371
- "grad_norm": 1.8674507141113281,
1372
- "learning_rate": 3.940639269406393e-06,
1373
- "loss": 0.5419,
1374
- "mean_token_accuracy": 0.8520117592066526,
1375
- "num_tokens": 23811.0,
1376
- "step": 1330
1377
- },
1378
- {
1379
- "entropy": 0.6931641317903996,
1380
- "epoch": 3.0594285714285716,
1381
- "grad_norm": 3.045653820037842,
1382
- "learning_rate": 3.8949771689497714e-06,
1383
- "loss": 0.6581,
1384
- "mean_token_accuracy": 0.8223089531064034,
1385
- "num_tokens": 33518.0,
1386
- "step": 1340
1387
- },
1388
- {
1389
- "entropy": 0.8081557534635067,
1390
- "epoch": 3.0822857142857143,
1391
- "grad_norm": 3.7819550037384033,
1392
- "learning_rate": 3.849315068493151e-06,
1393
- "loss": 0.7526,
1394
- "mean_token_accuracy": 0.8033633768558502,
1395
- "num_tokens": 40731.0,
1396
- "step": 1350
1397
- },
1398
- {
1399
- "entropy": 0.8427179055288434,
1400
- "epoch": 3.105142857142857,
1401
- "grad_norm": 3.792132616043091,
1402
- "learning_rate": 3.8036529680365297e-06,
1403
- "loss": 0.7482,
1404
- "mean_token_accuracy": 0.8013022668659687,
1405
- "num_tokens": 46454.0,
1406
- "step": 1360
1407
- },
1408
- {
1409
- "entropy": 0.649565021879971,
1410
- "epoch": 3.128,
1411
- "grad_norm": 1.703001618385315,
1412
- "learning_rate": 3.7579908675799087e-06,
1413
- "loss": 0.575,
1414
- "mean_token_accuracy": 0.8482304524630309,
1415
- "num_tokens": 58387.0,
1416
- "step": 1370
1417
- },
1418
- {
1419
- "entropy": 0.6079548856243491,
1420
- "epoch": 3.150857142857143,
1421
- "grad_norm": 2.0299415588378906,
1422
- "learning_rate": 3.7123287671232876e-06,
1423
- "loss": 0.5657,
1424
- "mean_token_accuracy": 0.8562000930309296,
1425
- "num_tokens": 71361.0,
1426
- "step": 1380
1427
- },
1428
- {
1429
- "entropy": 0.6868046056479216,
1430
- "epoch": 3.1737142857142855,
1431
- "grad_norm": 2.8305366039276123,
1432
- "learning_rate": 3.6666666666666666e-06,
1433
- "loss": 0.6482,
1434
- "mean_token_accuracy": 0.8264093812555074,
1435
- "num_tokens": 81416.0,
1436
- "step": 1390
1437
- },
1438
- {
1439
- "entropy": 0.7875820865854621,
1440
- "epoch": 3.1965714285714286,
1441
- "grad_norm": 2.7740256786346436,
1442
- "learning_rate": 3.6210045662100455e-06,
1443
- "loss": 0.692,
1444
- "mean_token_accuracy": 0.8111964620649814,
1445
- "num_tokens": 88744.0,
1446
- "step": 1400
1447
- },
1448
- {
1449
- "entropy": 0.8337188992649317,
1450
- "epoch": 3.2194285714285713,
1451
- "grad_norm": 3.5716171264648438,
1452
- "learning_rate": 3.575342465753425e-06,
1453
- "loss": 0.7367,
1454
- "mean_token_accuracy": 0.8026115108281374,
1455
- "num_tokens": 94445.0,
1456
- "step": 1410
1457
- },
1458
- {
1459
- "entropy": 0.6171189024113118,
1460
- "epoch": 3.2422857142857144,
1461
- "grad_norm": 2.167130708694458,
1462
- "learning_rate": 3.529680365296804e-06,
1463
- "loss": 0.5768,
1464
- "mean_token_accuracy": 0.8457576856017113,
1465
- "num_tokens": 106447.0,
1466
- "step": 1420
1467
- },
1468
- {
1469
- "entropy": 0.588103704340756,
1470
- "epoch": 3.265142857142857,
1471
- "grad_norm": 1.812523603439331,
1472
- "learning_rate": 3.4840182648401828e-06,
1473
- "loss": 0.5576,
1474
- "mean_token_accuracy": 0.850422840192914,
1475
- "num_tokens": 119288.0,
1476
- "step": 1430
1477
- },
1478
- {
1479
- "entropy": 0.7199772633612156,
1480
- "epoch": 3.288,
1481
- "grad_norm": 2.936997890472412,
1482
- "learning_rate": 3.4383561643835617e-06,
1483
- "loss": 0.6583,
1484
- "mean_token_accuracy": 0.8228961959481239,
1485
- "num_tokens": 129120.0,
1486
- "step": 1440
1487
- },
1488
- {
1489
- "entropy": 0.8470261815935374,
1490
- "epoch": 3.310857142857143,
1491
- "grad_norm": 3.213144302368164,
1492
- "learning_rate": 3.3926940639269407e-06,
1493
- "loss": 0.7691,
1494
- "mean_token_accuracy": 0.7942754574120044,
1495
- "num_tokens": 136353.0,
1496
- "step": 1450
1497
- },
1498
- {
1499
- "entropy": 0.7986762724816799,
1500
- "epoch": 3.3337142857142856,
1501
- "grad_norm": 3.557766914367676,
1502
- "learning_rate": 3.3470319634703196e-06,
1503
- "loss": 0.6934,
1504
- "mean_token_accuracy": 0.8079280138015748,
1505
- "num_tokens": 142146.0,
1506
- "step": 1460
1507
- },
1508
- {
1509
- "entropy": 0.6144793089479208,
1510
- "epoch": 3.3565714285714288,
1511
- "grad_norm": 1.8922325372695923,
1512
- "learning_rate": 3.3013698630136985e-06,
1513
- "loss": 0.5531,
1514
- "mean_token_accuracy": 0.8498249996453524,
1515
- "num_tokens": 154573.0,
1516
- "step": 1470
1517
- },
1518
- {
1519
- "entropy": 0.6026311848312617,
1520
- "epoch": 3.3794285714285714,
1521
- "grad_norm": 2.3621621131896973,
1522
- "learning_rate": 3.2557077625570775e-06,
1523
- "loss": 0.5788,
1524
- "mean_token_accuracy": 0.843865205347538,
1525
- "num_tokens": 167191.0,
1526
- "step": 1480
1527
- },
1528
- {
1529
- "entropy": 0.692712034098804,
1530
- "epoch": 3.402285714285714,
1531
- "grad_norm": 2.5524520874023438,
1532
- "learning_rate": 3.210045662100457e-06,
1533
- "loss": 0.6331,
1534
- "mean_token_accuracy": 0.8313357140868902,
1535
- "num_tokens": 176759.0,
1536
- "step": 1490
1537
- },
1538
- {
1539
- "entropy": 0.8092848775908351,
1540
- "epoch": 3.4251428571428573,
1541
- "grad_norm": 2.9528560638427734,
1542
- "learning_rate": 3.164383561643836e-06,
1543
- "loss": 0.729,
1544
- "mean_token_accuracy": 0.8075098406523467,
1545
- "num_tokens": 183859.0,
1546
- "step": 1500
1547
- },
1548
- {
1549
- "entropy": 0.8049974404275417,
1550
- "epoch": 3.448,
1551
- "grad_norm": 4.05062198638916,
1552
- "learning_rate": 3.1187214611872147e-06,
1553
- "loss": 0.7243,
1554
- "mean_token_accuracy": 0.8051667951047421,
1555
- "num_tokens": 189583.0,
1556
- "step": 1510
1557
- },
1558
- {
1559
- "entropy": 0.6506396448239684,
1560
- "epoch": 3.4708571428571426,
1561
- "grad_norm": 1.689829707145691,
1562
- "learning_rate": 3.0730593607305937e-06,
1563
- "loss": 0.5978,
1564
- "mean_token_accuracy": 0.8373685766011476,
1565
- "num_tokens": 201924.0,
1566
- "step": 1520
1567
- },
1568
- {
1569
- "entropy": 0.5990774085745215,
1570
- "epoch": 3.4937142857142858,
1571
- "grad_norm": 1.9814187288284302,
1572
- "learning_rate": 3.0273972602739726e-06,
1573
- "loss": 0.5512,
1574
- "mean_token_accuracy": 0.8501433119177818,
1575
- "num_tokens": 215022.0,
1576
- "step": 1530
1577
- },
1578
- {
1579
- "entropy": 0.6897100256755948,
1580
- "epoch": 3.5165714285714285,
1581
- "grad_norm": 2.9063243865966797,
1582
- "learning_rate": 2.9817351598173516e-06,
1583
- "loss": 0.6379,
1584
- "mean_token_accuracy": 0.8311983995139599,
1585
- "num_tokens": 225193.0,
1586
- "step": 1540
1587
- },
1588
- {
1589
- "entropy": 0.8148755021393299,
1590
- "epoch": 3.5394285714285716,
1591
- "grad_norm": 3.0410537719726562,
1592
- "learning_rate": 2.9360730593607305e-06,
1593
- "loss": 0.7679,
1594
- "mean_token_accuracy": 0.8003406222909689,
1595
- "num_tokens": 232748.0,
1596
- "step": 1550
1597
- },
1598
- {
1599
- "entropy": 0.829321713745594,
1600
- "epoch": 3.5622857142857143,
1601
- "grad_norm": 3.606051206588745,
1602
- "learning_rate": 2.8904109589041095e-06,
1603
- "loss": 0.7419,
1604
- "mean_token_accuracy": 0.8075093895196914,
1605
- "num_tokens": 238583.0,
1606
- "step": 1560
1607
- },
1608
- {
1609
- "entropy": 0.6521147439256311,
1610
- "epoch": 3.5851428571428574,
1611
- "grad_norm": 2.431035280227661,
1612
- "learning_rate": 2.8447488584474884e-06,
1613
- "loss": 0.5798,
1614
- "mean_token_accuracy": 0.8457587130367756,
1615
- "num_tokens": 251053.0,
1616
- "step": 1570
1617
- },
1618
- {
1619
- "entropy": 0.59049033485353,
1620
- "epoch": 3.608,
1621
- "grad_norm": 2.3326406478881836,
1622
- "learning_rate": 2.7990867579908678e-06,
1623
- "loss": 0.5584,
1624
- "mean_token_accuracy": 0.8490381717681885,
1625
- "num_tokens": 263796.0,
1626
- "step": 1580
1627
- },
1628
- {
1629
- "entropy": 0.6822433460503816,
1630
- "epoch": 3.630857142857143,
1631
- "grad_norm": 2.796164035797119,
1632
- "learning_rate": 2.7534246575342467e-06,
1633
- "loss": 0.6184,
1634
- "mean_token_accuracy": 0.8322674740105868,
1635
- "num_tokens": 273597.0,
1636
- "step": 1590
1637
- },
1638
- {
1639
- "entropy": 0.802175448089838,
1640
- "epoch": 3.653714285714286,
1641
- "grad_norm": 3.028571367263794,
1642
- "learning_rate": 2.7077625570776257e-06,
1643
- "loss": 0.7123,
1644
- "mean_token_accuracy": 0.810053302720189,
1645
- "num_tokens": 280717.0,
1646
- "step": 1600
1647
- },
1648
- {
1649
- "entropy": 0.834137250110507,
1650
- "epoch": 3.6765714285714286,
1651
- "grad_norm": 3.7012383937835693,
1652
- "learning_rate": 2.6621004566210046e-06,
1653
- "loss": 0.7221,
1654
- "mean_token_accuracy": 0.8077411744743586,
1655
- "num_tokens": 286435.0,
1656
- "step": 1610
1657
- },
1658
- {
1659
- "entropy": 0.6269000029191375,
1660
- "epoch": 3.6994285714285713,
1661
- "grad_norm": 1.8523389101028442,
1662
- "learning_rate": 2.6164383561643835e-06,
1663
- "loss": 0.5711,
1664
- "mean_token_accuracy": 0.8481807161122561,
1665
- "num_tokens": 299317.0,
1666
- "step": 1620
1667
- },
1668
- {
1669
- "entropy": 0.5854253606870771,
1670
- "epoch": 3.7222857142857144,
1671
- "grad_norm": 2.166839361190796,
1672
- "learning_rate": 2.5707762557077625e-06,
1673
- "loss": 0.5558,
1674
- "mean_token_accuracy": 0.8550698190927506,
1675
- "num_tokens": 312515.0,
1676
- "step": 1630
1677
- },
1678
- {
1679
- "entropy": 0.6648679519072175,
1680
- "epoch": 3.745142857142857,
1681
- "grad_norm": 2.547304391860962,
1682
- "learning_rate": 2.5251141552511414e-06,
1683
- "loss": 0.6132,
1684
- "mean_token_accuracy": 0.8365901987999678,
1685
- "num_tokens": 322556.0,
1686
- "step": 1640
1687
- },
1688
- {
1689
- "entropy": 0.7738912200555206,
1690
- "epoch": 3.768,
1691
- "grad_norm": 3.1837921142578125,
1692
- "learning_rate": 2.479452054794521e-06,
1693
- "loss": 0.7081,
1694
- "mean_token_accuracy": 0.8078534632921219,
1695
- "num_tokens": 329999.0,
1696
- "step": 1650
1697
- },
1698
- {
1699
- "entropy": 0.8226759549230337,
1700
- "epoch": 3.790857142857143,
1701
- "grad_norm": 3.3705971240997314,
1702
- "learning_rate": 2.4337899543378997e-06,
1703
- "loss": 0.718,
1704
- "mean_token_accuracy": 0.8068133030086756,
1705
- "num_tokens": 335854.0,
1706
- "step": 1660
1707
- },
1708
- {
1709
- "entropy": 0.6193161474540829,
1710
- "epoch": 3.8137142857142856,
1711
- "grad_norm": 1.969228982925415,
1712
- "learning_rate": 2.3881278538812787e-06,
1713
- "loss": 0.5679,
1714
- "mean_token_accuracy": 0.8450891207903624,
1715
- "num_tokens": 348492.0,
1716
- "step": 1670
1717
- },
1718
- {
1719
- "entropy": 0.5715032175183297,
1720
- "epoch": 3.8365714285714283,
1721
- "grad_norm": 2.211939573287964,
1722
- "learning_rate": 2.3424657534246576e-06,
1723
- "loss": 0.5261,
1724
- "mean_token_accuracy": 0.860519690066576,
1725
- "num_tokens": 361386.0,
1726
- "step": 1680
1727
- },
1728
- {
1729
- "entropy": 0.6757410081103444,
1730
- "epoch": 3.8594285714285714,
1731
- "grad_norm": 3.2903149127960205,
1732
- "learning_rate": 2.296803652968037e-06,
1733
- "loss": 0.6431,
1734
- "mean_token_accuracy": 0.825245127826929,
1735
- "num_tokens": 371053.0,
1736
- "step": 1690
1737
- },
1738
- {
1739
- "entropy": 0.7908247765153646,
1740
- "epoch": 3.8822857142857146,
1741
- "grad_norm": 2.754093885421753,
1742
- "learning_rate": 2.251141552511416e-06,
1743
- "loss": 0.7062,
1744
- "mean_token_accuracy": 0.8119447190314532,
1745
- "num_tokens": 378222.0,
1746
- "step": 1700
1747
- },
1748
- {
1749
- "entropy": 0.8256376493722201,
1750
- "epoch": 3.9051428571428572,
1751
- "grad_norm": 3.919424295425415,
1752
- "learning_rate": 2.205479452054795e-06,
1753
- "loss": 0.7394,
1754
- "mean_token_accuracy": 0.8005702033638954,
1755
- "num_tokens": 383976.0,
1756
- "step": 1710
1757
- },
1758
- {
1759
- "entropy": 0.6268141394481063,
1760
- "epoch": 3.928,
1761
- "grad_norm": 2.0268919467926025,
1762
- "learning_rate": 2.159817351598174e-06,
1763
- "loss": 0.5669,
1764
- "mean_token_accuracy": 0.8454550232738256,
1765
- "num_tokens": 395553.0,
1766
- "step": 1720
1767
- },
1768
- {
1769
- "entropy": 0.6326769331470132,
1770
- "epoch": 3.950857142857143,
1771
- "grad_norm": 2.593083381652832,
1772
- "learning_rate": 2.1141552511415528e-06,
1773
- "loss": 0.5813,
1774
- "mean_token_accuracy": 0.8418777663260698,
1775
- "num_tokens": 407215.0,
1776
- "step": 1730
1777
- },
1778
- {
1779
- "entropy": 0.7661219704896212,
1780
- "epoch": 3.9737142857142858,
1781
- "grad_norm": 2.932962417602539,
1782
- "learning_rate": 2.0684931506849317e-06,
1783
- "loss": 0.7034,
1784
- "mean_token_accuracy": 0.8150413926690817,
1785
- "num_tokens": 414979.0,
1786
- "step": 1740
1787
- },
1788
- {
1789
- "entropy": 0.8078423272818327,
1790
- "epoch": 3.9965714285714284,
1791
- "grad_norm": 4.143108367919922,
1792
- "learning_rate": 2.0228310502283106e-06,
1793
- "loss": 0.6825,
1794
- "mean_token_accuracy": 0.8139274593442678,
1795
- "num_tokens": 420607.0,
1796
- "step": 1750
1797
- },
1798
- {
1799
- "epoch": 4.0,
1800
- "eval_accuracy": 0.00985720114239086,
1801
- "eval_entropy": 0.833712039652018,
1802
- "eval_loss": 1.498316764831543,
1803
- "eval_mean_token_accuracy": 0.72518339603564,
1804
- "eval_num_tokens": 421194.0,
1805
- "eval_runtime": 325.9415,
1806
- "eval_samples_per_second": 3.172,
1807
- "eval_steps_per_second": 0.795,
1808
- "step": 1752
1809
- },
1810
- {
1811
- "entropy": 0.5481636302643701,
1812
- "epoch": 4.018285714285715,
1813
- "grad_norm": 1.847604513168335,
1814
- "learning_rate": 1.9771689497716896e-06,
1815
- "loss": 0.5007,
1816
- "mean_token_accuracy": 0.8671465257280752,
1817
- "num_tokens": 434791.0,
1818
- "step": 1760
1819
- },
1820
- {
1821
- "entropy": 0.5742504514753819,
1822
- "epoch": 4.041142857142857,
1823
- "grad_norm": 2.2244632244110107,
1824
- "learning_rate": 1.931506849315069e-06,
1825
- "loss": 0.5271,
1826
- "mean_token_accuracy": 0.8591625761240721,
1827
- "num_tokens": 447092.0,
1828
- "step": 1770
1829
- },
1830
- {
1831
- "entropy": 0.6953230138868094,
1832
- "epoch": 4.064,
1833
- "grad_norm": 3.42199444770813,
1834
- "learning_rate": 1.8858447488584477e-06,
1835
- "loss": 0.6469,
1836
- "mean_token_accuracy": 0.8285580322146415,
1837
- "num_tokens": 456254.0,
1838
- "step": 1780
1839
- },
1840
- {
1841
- "entropy": 0.7736836820840836,
1842
- "epoch": 4.086857142857143,
1843
- "grad_norm": 3.351454257965088,
1844
- "learning_rate": 1.8401826484018268e-06,
1845
- "loss": 0.6909,
1846
- "mean_token_accuracy": 0.8115375626832246,
1847
- "num_tokens": 463064.0,
1848
- "step": 1790
1849
- },
1850
- {
1851
- "entropy": 0.771463468298316,
1852
- "epoch": 4.109714285714285,
1853
- "grad_norm": 4.134479522705078,
1854
- "learning_rate": 1.7945205479452058e-06,
1855
- "loss": 0.6807,
1856
- "mean_token_accuracy": 0.8126782298088073,
1857
- "num_tokens": 468431.0,
1858
- "step": 1800
1859
- },
1860
- {
1861
- "entropy": 0.5649335160851479,
1862
- "epoch": 4.132571428571429,
1863
- "grad_norm": 2.1762540340423584,
1864
- "learning_rate": 1.7488584474885847e-06,
1865
- "loss": 0.5221,
1866
- "mean_token_accuracy": 0.8567749988287687,
1867
- "num_tokens": 482534.0,
1868
- "step": 1810
1869
- },
1870
- {
1871
- "entropy": 0.5837410872802138,
1872
- "epoch": 4.155428571428572,
1873
- "grad_norm": 2.349236011505127,
1874
- "learning_rate": 1.7031963470319637e-06,
1875
- "loss": 0.5371,
1876
- "mean_token_accuracy": 0.8581233065575361,
1877
- "num_tokens": 494845.0,
1878
- "step": 1820
1879
- },
1880
- {
1881
- "entropy": 0.6922593496739864,
1882
- "epoch": 4.178285714285714,
1883
- "grad_norm": 2.9896738529205322,
1884
- "learning_rate": 1.6575342465753428e-06,
1885
- "loss": 0.6648,
1886
- "mean_token_accuracy": 0.8241277992725372,
1887
- "num_tokens": 504153.0,
1888
- "step": 1830
1889
- },
1890
- {
1891
- "entropy": 0.7702124075964093,
1892
- "epoch": 4.201142857142857,
1893
- "grad_norm": 3.322385549545288,
1894
- "learning_rate": 1.6118721461187218e-06,
1895
- "loss": 0.6712,
1896
- "mean_token_accuracy": 0.8188040722161531,
1897
- "num_tokens": 511101.0,
1898
- "step": 1840
1899
- },
1900
- {
1901
- "entropy": 0.8012370727956295,
1902
- "epoch": 4.224,
1903
- "grad_norm": 4.359086036682129,
1904
- "learning_rate": 1.5662100456621007e-06,
1905
- "loss": 0.6748,
1906
- "mean_token_accuracy": 0.8118188168853522,
1907
- "num_tokens": 516460.0,
1908
- "step": 1850
1909
- },
1910
- {
1911
- "entropy": 0.5523576781153678,
1912
- "epoch": 4.246857142857142,
1913
- "grad_norm": 2.107539176940918,
1914
- "learning_rate": 1.5205479452054797e-06,
1915
- "loss": 0.5081,
1916
- "mean_token_accuracy": 0.8673421230167151,
1917
- "num_tokens": 530848.0,
1918
- "step": 1860
1919
- },
1920
- {
1921
- "entropy": 0.5729602897539735,
1922
- "epoch": 4.269714285714286,
1923
- "grad_norm": 2.5580902099609375,
1924
- "learning_rate": 1.4748858447488584e-06,
1925
- "loss": 0.5319,
1926
- "mean_token_accuracy": 0.8585442833602428,
1927
- "num_tokens": 543148.0,
1928
- "step": 1870
1929
- },
1930
- {
1931
- "entropy": 0.6956869766116143,
1932
- "epoch": 4.292571428571429,
1933
- "grad_norm": 3.1137397289276123,
1934
- "learning_rate": 1.4292237442922373e-06,
1935
- "loss": 0.6509,
1936
- "mean_token_accuracy": 0.8259295519441366,
1937
- "num_tokens": 552615.0,
1938
- "step": 1880
1939
- },
1940
- {
1941
- "entropy": 0.7814504994079471,
1942
- "epoch": 4.315428571428572,
1943
- "grad_norm": 3.9837899208068848,
1944
- "learning_rate": 1.3835616438356165e-06,
1945
- "loss": 0.6732,
1946
- "mean_token_accuracy": 0.8206901982426643,
1947
- "num_tokens": 559644.0,
1948
- "step": 1890
1949
- },
1950
- {
1951
- "entropy": 0.8006520505994559,
1952
- "epoch": 4.338285714285714,
1953
- "grad_norm": 4.293622016906738,
1954
- "learning_rate": 1.3378995433789954e-06,
1955
- "loss": 0.687,
1956
- "mean_token_accuracy": 0.8162542518228293,
1957
- "num_tokens": 565049.0,
1958
- "step": 1900
1959
- },
1960
- {
1961
- "entropy": 0.5543251828290522,
1962
- "epoch": 4.361142857142857,
1963
- "grad_norm": 2.0004706382751465,
1964
- "learning_rate": 1.2922374429223744e-06,
1965
- "loss": 0.5105,
1966
- "mean_token_accuracy": 0.8651686757802963,
1967
- "num_tokens": 579511.0,
1968
- "step": 1910
1969
- },
1970
- {
1971
- "entropy": 0.5778749627992511,
1972
- "epoch": 4.384,
1973
- "grad_norm": 2.4198007583618164,
1974
- "learning_rate": 1.2465753424657535e-06,
1975
- "loss": 0.5216,
1976
- "mean_token_accuracy": 0.8559423860162496,
1977
- "num_tokens": 591873.0,
1978
- "step": 1920
1979
- },
1980
- {
1981
- "entropy": 0.6606394873932004,
1982
- "epoch": 4.406857142857143,
1983
- "grad_norm": 3.485213279724121,
1984
- "learning_rate": 1.2009132420091325e-06,
1985
- "loss": 0.6086,
1986
- "mean_token_accuracy": 0.8372324761003256,
1987
- "num_tokens": 601337.0,
1988
- "step": 1930
1989
- },
1990
- {
1991
- "entropy": 0.7576876068487763,
1992
- "epoch": 4.429714285714286,
1993
- "grad_norm": 3.3860034942626953,
1994
- "learning_rate": 1.1552511415525116e-06,
1995
- "loss": 0.7,
1996
- "mean_token_accuracy": 0.8150843985378742,
1997
- "num_tokens": 608326.0,
1998
- "step": 1940
1999
- },
2000
- {
2001
- "entropy": 0.7779752794653177,
2002
- "epoch": 4.452571428571429,
2003
- "grad_norm": 4.422528266906738,
2004
- "learning_rate": 1.1095890410958906e-06,
2005
- "loss": 0.6755,
2006
- "mean_token_accuracy": 0.8102645222097635,
2007
- "num_tokens": 613860.0,
2008
- "step": 1950
2009
- },
2010
- {
2011
- "entropy": 0.5644787142053247,
2012
- "epoch": 4.475428571428571,
2013
- "grad_norm": 2.2278542518615723,
2014
- "learning_rate": 1.0639269406392695e-06,
2015
- "loss": 0.5188,
2016
- "mean_token_accuracy": 0.8587661664932966,
2017
- "num_tokens": 627853.0,
2018
- "step": 1960
2019
- },
2020
- {
2021
- "entropy": 0.5740617036819458,
2022
- "epoch": 4.498285714285714,
2023
- "grad_norm": 2.4340105056762695,
2024
- "learning_rate": 1.0182648401826485e-06,
2025
- "loss": 0.5167,
2026
- "mean_token_accuracy": 0.8595852922648192,
2027
- "num_tokens": 640020.0,
2028
- "step": 1970
2029
- },
2030
- {
2031
- "entropy": 0.671654068864882,
2032
- "epoch": 4.521142857142857,
2033
- "grad_norm": 3.127539873123169,
2034
- "learning_rate": 9.726027397260274e-07,
2035
- "loss": 0.6331,
2036
- "mean_token_accuracy": 0.8256836850196123,
2037
- "num_tokens": 649058.0,
2038
- "step": 1980
2039
- },
2040
- {
2041
- "entropy": 0.7583655359223485,
2042
- "epoch": 4.5440000000000005,
2043
- "grad_norm": 3.5964298248291016,
2044
- "learning_rate": 9.269406392694065e-07,
2045
- "loss": 0.679,
2046
- "mean_token_accuracy": 0.8139733098447323,
2047
- "num_tokens": 655831.0,
2048
- "step": 1990
2049
- },
2050
- {
2051
- "entropy": 0.7816360153257846,
2052
- "epoch": 4.566857142857143,
2053
- "grad_norm": 4.389492511749268,
2054
- "learning_rate": 8.812785388127855e-07,
2055
- "loss": 0.6784,
2056
- "mean_token_accuracy": 0.8120625615119934,
2057
- "num_tokens": 661256.0,
2058
- "step": 2000
2059
- },
2060
- {
2061
- "entropy": 0.5732687024399639,
2062
- "epoch": 4.589714285714286,
2063
- "grad_norm": 2.0767221450805664,
2064
- "learning_rate": 8.356164383561644e-07,
2065
- "loss": 0.5335,
2066
- "mean_token_accuracy": 0.8624513667076826,
2067
- "num_tokens": 675612.0,
2068
- "step": 2010
2069
- },
2070
- {
2071
- "entropy": 0.5804870082065463,
2072
- "epoch": 4.612571428571428,
2073
- "grad_norm": 2.554534673690796,
2074
- "learning_rate": 7.899543378995435e-07,
2075
- "loss": 0.5238,
2076
- "mean_token_accuracy": 0.8590863507241011,
2077
- "num_tokens": 687684.0,
2078
- "step": 2020
2079
- },
2080
- {
2081
- "entropy": 0.6967678766697645,
2082
- "epoch": 4.635428571428571,
2083
- "grad_norm": 3.255140542984009,
2084
- "learning_rate": 7.442922374429224e-07,
2085
- "loss": 0.6487,
2086
- "mean_token_accuracy": 0.8244634248316288,
2087
- "num_tokens": 696675.0,
2088
- "step": 2030
2089
- },
2090
- {
2091
- "entropy": 0.7526340587064624,
2092
- "epoch": 4.658285714285714,
2093
- "grad_norm": 3.69323992729187,
2094
- "learning_rate": 6.986301369863015e-07,
2095
- "loss": 0.6719,
2096
- "mean_token_accuracy": 0.8216490592807532,
2097
- "num_tokens": 703456.0,
2098
- "step": 2040
2099
- },
2100
- {
2101
- "entropy": 0.7950498787686229,
2102
- "epoch": 4.6811428571428575,
2103
- "grad_norm": 4.715794563293457,
2104
- "learning_rate": 6.529680365296804e-07,
2105
- "loss": 0.6808,
2106
- "mean_token_accuracy": 0.8184644509106874,
2107
- "num_tokens": 708782.0,
2108
- "step": 2050
2109
- },
2110
- {
2111
- "entropy": 0.5505135927349329,
2112
- "epoch": 4.704,
2113
- "grad_norm": 2.3146073818206787,
2114
- "learning_rate": 6.073059360730594e-07,
2115
- "loss": 0.507,
2116
- "mean_token_accuracy": 0.8652824487537145,
2117
- "num_tokens": 723247.0,
2118
- "step": 2060
2119
- },
2120
- {
2121
- "entropy": 0.5807493371888995,
2122
- "epoch": 4.726857142857143,
2123
- "grad_norm": 2.615732192993164,
2124
- "learning_rate": 5.616438356164384e-07,
2125
- "loss": 0.5342,
2126
- "mean_token_accuracy": 0.854337964951992,
2127
- "num_tokens": 735283.0,
2128
- "step": 2070
2129
- },
2130
- {
2131
- "entropy": 0.7081154704093933,
2132
- "epoch": 4.749714285714286,
2133
- "grad_norm": 3.0795960426330566,
2134
- "learning_rate": 5.159817351598174e-07,
2135
- "loss": 0.6499,
2136
- "mean_token_accuracy": 0.8241405732929706,
2137
- "num_tokens": 744298.0,
2138
- "step": 2080
2139
- },
2140
- {
2141
- "entropy": 0.783203998953104,
2142
- "epoch": 4.772571428571428,
2143
- "grad_norm": 3.7807230949401855,
2144
- "learning_rate": 4.7031963470319636e-07,
2145
- "loss": 0.6948,
2146
- "mean_token_accuracy": 0.8167315106838942,
2147
- "num_tokens": 751212.0,
2148
- "step": 2090
2149
- },
2150
- {
2151
- "entropy": 0.7742167858406901,
2152
- "epoch": 4.795428571428571,
2153
- "grad_norm": 4.185308933258057,
2154
- "learning_rate": 4.2465753424657536e-07,
2155
- "loss": 0.6705,
2156
- "mean_token_accuracy": 0.8145616598427295,
2157
- "num_tokens": 756648.0,
2158
- "step": 2100
2159
- },
2160
- {
2161
- "entropy": 0.5554421614855528,
2162
- "epoch": 4.8182857142857145,
2163
- "grad_norm": 2.0456132888793945,
2164
- "learning_rate": 3.7899543378995436e-07,
2165
- "loss": 0.4982,
2166
- "mean_token_accuracy": 0.8656269229948521,
2167
- "num_tokens": 771047.0,
2168
- "step": 2110
2169
- },
2170
- {
2171
- "entropy": 0.5584687992930413,
2172
- "epoch": 4.841142857142858,
2173
- "grad_norm": 2.591322422027588,
2174
- "learning_rate": 3.3333333333333335e-07,
2175
- "loss": 0.5038,
2176
- "mean_token_accuracy": 0.8639244794845581,
2177
- "num_tokens": 783484.0,
2178
- "step": 2120
2179
- },
2180
- {
2181
- "entropy": 0.6443877406418324,
2182
- "epoch": 4.864,
2183
- "grad_norm": 3.1148664951324463,
2184
- "learning_rate": 2.8767123287671235e-07,
2185
- "loss": 0.5898,
2186
- "mean_token_accuracy": 0.8393101956695318,
2187
- "num_tokens": 793159.0,
2188
- "step": 2130
2189
- },
2190
- {
2191
- "entropy": 0.7694938328117132,
2192
- "epoch": 4.886857142857143,
2193
- "grad_norm": 3.860647201538086,
2194
- "learning_rate": 2.4200913242009135e-07,
2195
- "loss": 0.6661,
2196
- "mean_token_accuracy": 0.8193999473005533,
2197
- "num_tokens": 800322.0,
2198
- "step": 2140
2199
- },
2200
- {
2201
- "entropy": 0.7594648722559214,
2202
- "epoch": 4.909714285714285,
2203
- "grad_norm": 4.108844757080078,
2204
- "learning_rate": 1.9634703196347034e-07,
2205
- "loss": 0.6513,
2206
- "mean_token_accuracy": 0.820786502957344,
2207
- "num_tokens": 805782.0,
2208
- "step": 2150
2209
- },
2210
- {
2211
- "entropy": 0.5718974178656936,
2212
- "epoch": 4.932571428571428,
2213
- "grad_norm": 2.0961592197418213,
2214
- "learning_rate": 1.5068493150684934e-07,
2215
- "loss": 0.5321,
2216
- "mean_token_accuracy": 0.8587038304656744,
2217
- "num_tokens": 819399.0,
2218
- "step": 2160
2219
- },
2220
- {
2221
- "entropy": 0.6284356378018856,
2222
- "epoch": 4.9554285714285715,
2223
- "grad_norm": 2.863541841506958,
2224
- "learning_rate": 1.0502283105022832e-07,
2225
- "loss": 0.5916,
2226
- "mean_token_accuracy": 0.8409151379019022,
2227
- "num_tokens": 830244.0,
2228
- "step": 2170
2229
- },
2230
- {
2231
- "entropy": 0.7672561943531037,
2232
- "epoch": 4.978285714285715,
2233
- "grad_norm": 3.52964186668396,
2234
- "learning_rate": 5.936073059360731e-08,
2235
- "loss": 0.6968,
2236
- "mean_token_accuracy": 0.8189954232424498,
2237
- "num_tokens": 837492.0,
2238
- "step": 2180
2239
- },
2240
- {
2241
- "entropy": 0.7868584784630098,
2242
- "epoch": 5.0,
2243
- "grad_norm": 6.119595050811768,
2244
- "learning_rate": 1.3698630136986303e-08,
2245
- "loss": 0.6771,
2246
- "mean_token_accuracy": 0.8141911743502868,
2247
- "num_tokens": 842388.0,
2248
- "step": 2190
2249
- },
2250
- {
2251
- "epoch": 5.0,
2252
- "eval_accuracy": 0.009840881272949817,
2253
- "eval_entropy": 0.8240771901193272,
2254
- "eval_loss": 1.5346653461456299,
2255
- "eval_mean_token_accuracy": 0.7217774188656605,
2256
- "eval_num_tokens": 842388.0,
2257
- "eval_runtime": 325.5092,
2258
- "eval_samples_per_second": 3.177,
2259
- "eval_steps_per_second": 0.796,
2260
- "step": 2190
2261
- }
2262
- ],
2263
- "logging_steps": 10,
2264
- "max_steps": 2190,
2265
- "num_input_tokens_seen": 0,
2266
- "num_train_epochs": 5,
2267
- "save_steps": 500,
2268
- "stateful_callbacks": {
2269
- "TrainerControl": {
2270
- "args": {
2271
- "should_epoch_stop": false,
2272
- "should_evaluate": false,
2273
- "should_log": false,
2274
- "should_save": true,
2275
- "should_training_stop": true
2276
- },
2277
- "attributes": {}
2278
- }
2279
- },
2280
- "total_flos": 1.470014665030656e+17,
2281
- "train_batch_size": 1,
2282
- "trial_name": null,
2283
- "trial_params": null
2284
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
last-checkpoint/training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4f69888a63d7dbdc14452efa835b06908606f8c384200e78ac65b432c94b2f31
3
- size 6353