jrosseruk commited on
Commit
0f56d94
·
verified ·
1 Parent(s): 173034f

Upload folder using huggingface_hub

Browse files
split-1/README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: allenai/OLMo-3-1025-7B
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:allenai/OLMo-3-1025-7B
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.18.1
split-1/adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "allenai/OLMo-3-1025-7B",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.1,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "up_proj",
33
+ "q_proj",
34
+ "v_proj",
35
+ "gate_proj",
36
+ "k_proj",
37
+ "down_proj",
38
+ "o_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
split-1/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a19df5544dbd6a7b902472baa837ea6e533ec33a161e46b57ea454cdd3210b0
3
+ size 319876032
split-1/chat_template.jinja ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {% set has_system = messages|selectattr('role', 'equalto', 'system')|list|length > 0 %}{% if not has_system %}{{ '<|im_start|>system
2
+ You are Olmo, a helpful AI assistant built by Ai2. Your date cutoff is December 2024, and your model weights are available at https://huggingface.co/allenai.<|im_end|>
3
+ ' }}{% endif %}{% for message in messages %}{% if message['role'] == 'system' %}{{ '<|im_start|>system
4
+ ' + message['content'] }}{% if message.get('functions', none) is not none %}{{ ' <functions>' + message['functions'] + '</functions><|im_end|>
5
+ ' }}{% else %}{{ ' You do not currently have access to any functions. <functions></functions><|im_end|>
6
+ ' }}{% endif %}{% elif message['role'] == 'user' %}{% if message.get('functions', none) is not none %}{{ '<|im_start|>user
7
+ ' + message['content'] + '
8
+ ' + '<functions>' + message['functions'] + '</functions><|im_end|>
9
+ ' }}{% else %}{{ '<|im_start|>user
10
+ ' + message['content'] + '<|im_end|>
11
+ ' }}{% endif %}{% elif message['role'] == 'assistant' %}{{ '<|im_start|>assistant
12
+ ' }}{% if message.get('content', none) is not none %}{{ message['content'] }}{% endif %}{% if message.get('function_calls', none) is not none %}{{ '<function_calls>' + message['function_calls'] + '</function_calls>' }}{% endif %}{% if not loop.last %}{{ '<|im_end|>' + '
13
+ ' }}{% else %}{{ eos_token }}{% endif %}{% elif message['role'] == 'environment' %}{{ '<|im_start|>environment
14
+ ' + message['content'] + '<|im_end|>
15
+ ' }}{% endif %}{% if loop.last and add_generation_prompt %}{{ '<|im_start|>assistant
16
+ <think>' }}{% endif %}{% endfor %}
split-1/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
split-1/rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0bb21e7e2e58e6bade45d510ff17893918d5549af5c5f6c003ce58fb0497475a
3
+ size 6981
split-1/special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "pad_token": {
10
+ "content": "<|pad|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
split-1/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
split-1/tokenizer_config.json ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "100256": {
5
+ "content": "<|extra_id_0|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": false
11
+ },
12
+ "100257": {
13
+ "content": "<|endoftext|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "100258": {
21
+ "content": "<|fim_prefix|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "100259": {
29
+ "content": "<|fim_middle|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "100260": {
37
+ "content": "<|fim_suffix|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "100261": {
45
+ "content": "|||PHONE_NUMBER|||",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": false
51
+ },
52
+ "100262": {
53
+ "content": "|||EMAIL_ADDRESS|||",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": false
59
+ },
60
+ "100263": {
61
+ "content": "|||IP_ADDRESS|||",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": false
67
+ },
68
+ "100264": {
69
+ "content": "<|im_start|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "100265": {
77
+ "content": "<|im_end|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "100266": {
85
+ "content": "<|extra_id_1|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": false
91
+ },
92
+ "100267": {
93
+ "content": "<|extra_id_2|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": false
99
+ },
100
+ "100268": {
101
+ "content": "<|extra_id_3|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": false
107
+ },
108
+ "100269": {
109
+ "content": "<|extra_id_4|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": false
115
+ },
116
+ "100270": {
117
+ "content": "<|extra_id_5|>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "100271": {
125
+ "content": "<|extra_id_6|>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "100272": {
133
+ "content": "<|extra_id_7|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "100273": {
141
+ "content": "<|extra_id_8|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "100274": {
149
+ "content": "<|extra_id_9|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "100275": {
157
+ "content": "<|extra_id_10|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "100276": {
165
+ "content": "<|endofprompt|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "100277": {
173
+ "content": "<|pad|>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": true
179
+ }
180
+ },
181
+ "bos_token": null,
182
+ "clean_up_tokenization_spaces": false,
183
+ "eos_token": "<|endoftext|>",
184
+ "extra_special_tokens": {},
185
+ "model_max_length": 65536,
186
+ "pad_token": "<|pad|>",
187
+ "tokenizer_class": "GPT2Tokenizer",
188
+ "unk_token": "<|endoftext|>"
189
+ }
split-1/trainer_state.json ADDED
@@ -0,0 +1,1523 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 1.0,
6
+ "eval_steps": 500,
7
+ "global_step": 148,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "entropy": 0.8085902333259583,
14
+ "epoch": 0.006756756756756757,
15
+ "grad_norm": 0.14047187566757202,
16
+ "learning_rate": 0.0,
17
+ "loss": 1.0639,
18
+ "mean_token_accuracy": 0.698029100894928,
19
+ "num_tokens": 520290.0,
20
+ "step": 1
21
+ },
22
+ {
23
+ "entropy": 0.839566707611084,
24
+ "epoch": 0.013513513513513514,
25
+ "grad_norm": 0.1440640240907669,
26
+ "learning_rate": 4e-05,
27
+ "loss": 1.0952,
28
+ "mean_token_accuracy": 0.687626302242279,
29
+ "num_tokens": 1042076.0,
30
+ "step": 2
31
+ },
32
+ {
33
+ "entropy": 0.8076200485229492,
34
+ "epoch": 0.02027027027027027,
35
+ "grad_norm": 0.13277006149291992,
36
+ "learning_rate": 8e-05,
37
+ "loss": 1.0518,
38
+ "mean_token_accuracy": 0.703219473361969,
39
+ "num_tokens": 1561230.0,
40
+ "step": 3
41
+ },
42
+ {
43
+ "entropy": 0.8777214884757996,
44
+ "epoch": 0.02702702702702703,
45
+ "grad_norm": 0.11633922904729843,
46
+ "learning_rate": 0.00012,
47
+ "loss": 1.0688,
48
+ "mean_token_accuracy": 0.6899521946907043,
49
+ "num_tokens": 2082559.0,
50
+ "step": 4
51
+ },
52
+ {
53
+ "entropy": 0.9525601863861084,
54
+ "epoch": 0.033783783783783786,
55
+ "grad_norm": 0.11181541532278061,
56
+ "learning_rate": 0.00016,
57
+ "loss": 1.0118,
58
+ "mean_token_accuracy": 0.6984938383102417,
59
+ "num_tokens": 2604188.0,
60
+ "step": 5
61
+ },
62
+ {
63
+ "entropy": 1.030306100845337,
64
+ "epoch": 0.04054054054054054,
65
+ "grad_norm": 0.1037818044424057,
66
+ "learning_rate": 0.0002,
67
+ "loss": 0.9858,
68
+ "mean_token_accuracy": 0.7052151560783386,
69
+ "num_tokens": 3125438.0,
70
+ "step": 6
71
+ },
72
+ {
73
+ "entropy": 1.058103084564209,
74
+ "epoch": 0.0472972972972973,
75
+ "grad_norm": 0.08032543957233429,
76
+ "learning_rate": 0.00019860139860139862,
77
+ "loss": 0.9888,
78
+ "mean_token_accuracy": 0.700404167175293,
79
+ "num_tokens": 3647168.0,
80
+ "step": 7
81
+ },
82
+ {
83
+ "entropy": 1.0215051174163818,
84
+ "epoch": 0.05405405405405406,
85
+ "grad_norm": 0.11070238053798676,
86
+ "learning_rate": 0.0001972027972027972,
87
+ "loss": 0.9369,
88
+ "mean_token_accuracy": 0.7151404619216919,
89
+ "num_tokens": 4167149.0,
90
+ "step": 8
91
+ },
92
+ {
93
+ "entropy": 1.0240428447723389,
94
+ "epoch": 0.060810810810810814,
95
+ "grad_norm": 0.08369743078947067,
96
+ "learning_rate": 0.00019580419580419583,
97
+ "loss": 0.9682,
98
+ "mean_token_accuracy": 0.7065526843070984,
99
+ "num_tokens": 4685442.0,
100
+ "step": 9
101
+ },
102
+ {
103
+ "entropy": 0.9852886199951172,
104
+ "epoch": 0.06756756756756757,
105
+ "grad_norm": 0.045641954988241196,
106
+ "learning_rate": 0.0001944055944055944,
107
+ "loss": 0.9657,
108
+ "mean_token_accuracy": 0.7040433287620544,
109
+ "num_tokens": 5207318.0,
110
+ "step": 10
111
+ },
112
+ {
113
+ "entropy": 0.9090337157249451,
114
+ "epoch": 0.07432432432432433,
115
+ "grad_norm": 0.036005664616823196,
116
+ "learning_rate": 0.000193006993006993,
117
+ "loss": 0.9261,
118
+ "mean_token_accuracy": 0.713391900062561,
119
+ "num_tokens": 5728363.0,
120
+ "step": 11
121
+ },
122
+ {
123
+ "entropy": 0.8860379457473755,
124
+ "epoch": 0.08108108108108109,
125
+ "grad_norm": 0.0332878939807415,
126
+ "learning_rate": 0.00019160839160839161,
127
+ "loss": 0.9255,
128
+ "mean_token_accuracy": 0.7144545912742615,
129
+ "num_tokens": 6249685.0,
130
+ "step": 12
131
+ },
132
+ {
133
+ "entropy": 0.8721570372581482,
134
+ "epoch": 0.08783783783783784,
135
+ "grad_norm": 0.034468844532966614,
136
+ "learning_rate": 0.00019020979020979023,
137
+ "loss": 0.9262,
138
+ "mean_token_accuracy": 0.7132186889648438,
139
+ "num_tokens": 6771073.0,
140
+ "step": 13
141
+ },
142
+ {
143
+ "entropy": 0.8796703815460205,
144
+ "epoch": 0.0945945945945946,
145
+ "grad_norm": 0.03204338252544403,
146
+ "learning_rate": 0.00018881118881118882,
147
+ "loss": 0.9312,
148
+ "mean_token_accuracy": 0.7109803557395935,
149
+ "num_tokens": 7292911.0,
150
+ "step": 14
151
+ },
152
+ {
153
+ "entropy": 0.898871660232544,
154
+ "epoch": 0.10135135135135136,
155
+ "grad_norm": 0.02886817790567875,
156
+ "learning_rate": 0.00018741258741258743,
157
+ "loss": 0.9331,
158
+ "mean_token_accuracy": 0.7116541266441345,
159
+ "num_tokens": 7808250.0,
160
+ "step": 15
161
+ },
162
+ {
163
+ "entropy": 0.8994000554084778,
164
+ "epoch": 0.10810810810810811,
165
+ "grad_norm": 0.021798841655254364,
166
+ "learning_rate": 0.00018601398601398602,
167
+ "loss": 0.9064,
168
+ "mean_token_accuracy": 0.717405378818512,
169
+ "num_tokens": 8327930.0,
170
+ "step": 16
171
+ },
172
+ {
173
+ "entropy": 0.9358035922050476,
174
+ "epoch": 0.11486486486486487,
175
+ "grad_norm": 0.020242631435394287,
176
+ "learning_rate": 0.00018461538461538463,
177
+ "loss": 0.9271,
178
+ "mean_token_accuracy": 0.7115734815597534,
179
+ "num_tokens": 8843022.0,
180
+ "step": 17
181
+ },
182
+ {
183
+ "entropy": 0.9195488095283508,
184
+ "epoch": 0.12162162162162163,
185
+ "grad_norm": 0.02282623015344143,
186
+ "learning_rate": 0.00018321678321678322,
187
+ "loss": 0.8845,
188
+ "mean_token_accuracy": 0.7234280109405518,
189
+ "num_tokens": 9362231.0,
190
+ "step": 18
191
+ },
192
+ {
193
+ "entropy": 0.9643428325653076,
194
+ "epoch": 0.12837837837837837,
195
+ "grad_norm": 0.02410944364964962,
196
+ "learning_rate": 0.00018181818181818183,
197
+ "loss": 0.924,
198
+ "mean_token_accuracy": 0.7115726470947266,
199
+ "num_tokens": 9883300.0,
200
+ "step": 19
201
+ },
202
+ {
203
+ "entropy": 0.9452087879180908,
204
+ "epoch": 0.13513513513513514,
205
+ "grad_norm": 0.024604445323348045,
206
+ "learning_rate": 0.00018041958041958042,
207
+ "loss": 0.9038,
208
+ "mean_token_accuracy": 0.7173081636428833,
209
+ "num_tokens": 10403398.0,
210
+ "step": 20
211
+ },
212
+ {
213
+ "entropy": 0.9439239501953125,
214
+ "epoch": 0.14189189189189189,
215
+ "grad_norm": 0.022483721375465393,
216
+ "learning_rate": 0.00017902097902097904,
217
+ "loss": 0.9088,
218
+ "mean_token_accuracy": 0.7151318192481995,
219
+ "num_tokens": 10924591.0,
220
+ "step": 21
221
+ },
222
+ {
223
+ "entropy": 0.9263131022453308,
224
+ "epoch": 0.14864864864864866,
225
+ "grad_norm": 0.020426636561751366,
226
+ "learning_rate": 0.00017762237762237762,
227
+ "loss": 0.9073,
228
+ "mean_token_accuracy": 0.7165982127189636,
229
+ "num_tokens": 11445479.0,
230
+ "step": 22
231
+ },
232
+ {
233
+ "entropy": 0.8799407482147217,
234
+ "epoch": 0.1554054054054054,
235
+ "grad_norm": 0.015485940501093864,
236
+ "learning_rate": 0.00017622377622377624,
237
+ "loss": 0.8733,
238
+ "mean_token_accuracy": 0.7245533466339111,
239
+ "num_tokens": 11967017.0,
240
+ "step": 23
241
+ },
242
+ {
243
+ "entropy": 0.8788759708404541,
244
+ "epoch": 0.16216216216216217,
245
+ "grad_norm": 0.017046676948666573,
246
+ "learning_rate": 0.00017482517482517485,
247
+ "loss": 0.8949,
248
+ "mean_token_accuracy": 0.7203764319419861,
249
+ "num_tokens": 12488697.0,
250
+ "step": 24
251
+ },
252
+ {
253
+ "entropy": 0.8779318332672119,
254
+ "epoch": 0.16891891891891891,
255
+ "grad_norm": 0.018766306340694427,
256
+ "learning_rate": 0.0001734265734265734,
257
+ "loss": 0.8983,
258
+ "mean_token_accuracy": 0.7172704935073853,
259
+ "num_tokens": 13009235.0,
260
+ "step": 25
261
+ },
262
+ {
263
+ "entropy": 0.8834930658340454,
264
+ "epoch": 0.17567567567567569,
265
+ "grad_norm": 0.020971953868865967,
266
+ "learning_rate": 0.00017202797202797203,
267
+ "loss": 0.9148,
268
+ "mean_token_accuracy": 0.714013397693634,
269
+ "num_tokens": 13530899.0,
270
+ "step": 26
271
+ },
272
+ {
273
+ "entropy": 0.8656618595123291,
274
+ "epoch": 0.18243243243243243,
275
+ "grad_norm": 0.018672997131943703,
276
+ "learning_rate": 0.00017062937062937064,
277
+ "loss": 0.8888,
278
+ "mean_token_accuracy": 0.7208808064460754,
279
+ "num_tokens": 14052671.0,
280
+ "step": 27
281
+ },
282
+ {
283
+ "entropy": 0.9078769087791443,
284
+ "epoch": 0.1891891891891892,
285
+ "grad_norm": 0.017195429652929306,
286
+ "learning_rate": 0.00016923076923076923,
287
+ "loss": 0.9242,
288
+ "mean_token_accuracy": 0.7125155329704285,
289
+ "num_tokens": 14574400.0,
290
+ "step": 28
291
+ },
292
+ {
293
+ "entropy": 0.9030580520629883,
294
+ "epoch": 0.19594594594594594,
295
+ "grad_norm": 0.01640770211815834,
296
+ "learning_rate": 0.00016783216783216784,
297
+ "loss": 0.9033,
298
+ "mean_token_accuracy": 0.7164167761802673,
299
+ "num_tokens": 15094634.0,
300
+ "step": 29
301
+ },
302
+ {
303
+ "entropy": 0.9503644108772278,
304
+ "epoch": 0.20270270270270271,
305
+ "grad_norm": 0.015921002253890038,
306
+ "learning_rate": 0.00016643356643356646,
307
+ "loss": 0.9454,
308
+ "mean_token_accuracy": 0.7049751877784729,
309
+ "num_tokens": 15616494.0,
310
+ "step": 30
311
+ },
312
+ {
313
+ "entropy": 0.9235352873802185,
314
+ "epoch": 0.20945945945945946,
315
+ "grad_norm": 0.0211192574352026,
316
+ "learning_rate": 0.00016503496503496504,
317
+ "loss": 0.9083,
318
+ "mean_token_accuracy": 0.7148462533950806,
319
+ "num_tokens": 16137586.0,
320
+ "step": 31
321
+ },
322
+ {
323
+ "entropy": 0.9228708744049072,
324
+ "epoch": 0.21621621621621623,
325
+ "grad_norm": 0.016181398183107376,
326
+ "learning_rate": 0.00016363636363636366,
327
+ "loss": 0.9076,
328
+ "mean_token_accuracy": 0.715546190738678,
329
+ "num_tokens": 16658862.0,
330
+ "step": 32
331
+ },
332
+ {
333
+ "entropy": 0.9239456653594971,
334
+ "epoch": 0.22297297297297297,
335
+ "grad_norm": 0.018280163407325745,
336
+ "learning_rate": 0.00016223776223776225,
337
+ "loss": 0.8973,
338
+ "mean_token_accuracy": 0.7179359197616577,
339
+ "num_tokens": 17181392.0,
340
+ "step": 33
341
+ },
342
+ {
343
+ "entropy": 0.9081429243087769,
344
+ "epoch": 0.22972972972972974,
345
+ "grad_norm": 0.01677747257053852,
346
+ "learning_rate": 0.00016083916083916083,
347
+ "loss": 0.89,
348
+ "mean_token_accuracy": 0.7211325764656067,
349
+ "num_tokens": 17701533.0,
350
+ "step": 34
351
+ },
352
+ {
353
+ "entropy": 0.8966501951217651,
354
+ "epoch": 0.23648648648648649,
355
+ "grad_norm": 0.015331600792706013,
356
+ "learning_rate": 0.00015944055944055945,
357
+ "loss": 0.8834,
358
+ "mean_token_accuracy": 0.7213951349258423,
359
+ "num_tokens": 18222829.0,
360
+ "step": 35
361
+ },
362
+ {
363
+ "entropy": 0.8946911096572876,
364
+ "epoch": 0.24324324324324326,
365
+ "grad_norm": 0.015129225328564644,
366
+ "learning_rate": 0.00015804195804195806,
367
+ "loss": 0.8895,
368
+ "mean_token_accuracy": 0.7205337882041931,
369
+ "num_tokens": 18743612.0,
370
+ "step": 36
371
+ },
372
+ {
373
+ "entropy": 0.924943208694458,
374
+ "epoch": 0.25,
375
+ "grad_norm": 0.016918186098337173,
376
+ "learning_rate": 0.00015664335664335665,
377
+ "loss": 0.934,
378
+ "mean_token_accuracy": 0.7084062099456787,
379
+ "num_tokens": 19265066.0,
380
+ "step": 37
381
+ },
382
+ {
383
+ "entropy": 0.8968441486358643,
384
+ "epoch": 0.25675675675675674,
385
+ "grad_norm": 0.01708938553929329,
386
+ "learning_rate": 0.00015524475524475526,
387
+ "loss": 0.9031,
388
+ "mean_token_accuracy": 0.7160695791244507,
389
+ "num_tokens": 19785775.0,
390
+ "step": 38
391
+ },
392
+ {
393
+ "entropy": 0.9021536111831665,
394
+ "epoch": 0.2635135135135135,
395
+ "grad_norm": 0.01632198505103588,
396
+ "learning_rate": 0.00015384615384615385,
397
+ "loss": 0.9106,
398
+ "mean_token_accuracy": 0.7141885161399841,
399
+ "num_tokens": 20306201.0,
400
+ "step": 39
401
+ },
402
+ {
403
+ "entropy": 0.9144896864891052,
404
+ "epoch": 0.2702702702702703,
405
+ "grad_norm": 0.015601382590830326,
406
+ "learning_rate": 0.00015244755244755244,
407
+ "loss": 0.9177,
408
+ "mean_token_accuracy": 0.7121185660362244,
409
+ "num_tokens": 20827655.0,
410
+ "step": 40
411
+ },
412
+ {
413
+ "entropy": 0.905925989151001,
414
+ "epoch": 0.27702702702702703,
415
+ "grad_norm": 0.015409526415169239,
416
+ "learning_rate": 0.00015104895104895105,
417
+ "loss": 0.9078,
418
+ "mean_token_accuracy": 0.7149533033370972,
419
+ "num_tokens": 21347179.0,
420
+ "step": 41
421
+ },
422
+ {
423
+ "entropy": 0.9454245567321777,
424
+ "epoch": 0.28378378378378377,
425
+ "grad_norm": 0.01604871265590191,
426
+ "learning_rate": 0.00014965034965034964,
427
+ "loss": 0.9382,
428
+ "mean_token_accuracy": 0.7070503234863281,
429
+ "num_tokens": 21867430.0,
430
+ "step": 42
431
+ },
432
+ {
433
+ "entropy": 0.8873435258865356,
434
+ "epoch": 0.2905405405405405,
435
+ "grad_norm": 0.01542913168668747,
436
+ "learning_rate": 0.00014825174825174825,
437
+ "loss": 0.8767,
438
+ "mean_token_accuracy": 0.722999632358551,
439
+ "num_tokens": 22386589.0,
440
+ "step": 43
441
+ },
442
+ {
443
+ "entropy": 0.8967331647872925,
444
+ "epoch": 0.2972972972972973,
445
+ "grad_norm": 0.016037292778491974,
446
+ "learning_rate": 0.00014685314685314687,
447
+ "loss": 0.8846,
448
+ "mean_token_accuracy": 0.7215548157691956,
449
+ "num_tokens": 22901934.0,
450
+ "step": 44
451
+ },
452
+ {
453
+ "entropy": 0.8978803753852844,
454
+ "epoch": 0.30405405405405406,
455
+ "grad_norm": 0.015742763876914978,
456
+ "learning_rate": 0.00014545454545454546,
457
+ "loss": 0.8913,
458
+ "mean_token_accuracy": 0.7180163264274597,
459
+ "num_tokens": 23422539.0,
460
+ "step": 45
461
+ },
462
+ {
463
+ "entropy": 0.8771539926528931,
464
+ "epoch": 0.3108108108108108,
465
+ "grad_norm": 0.015965279191732407,
466
+ "learning_rate": 0.00014405594405594407,
467
+ "loss": 0.8759,
468
+ "mean_token_accuracy": 0.7238014936447144,
469
+ "num_tokens": 23943911.0,
470
+ "step": 46
471
+ },
472
+ {
473
+ "entropy": 0.9169310331344604,
474
+ "epoch": 0.31756756756756754,
475
+ "grad_norm": 0.01562552899122238,
476
+ "learning_rate": 0.00014265734265734269,
477
+ "loss": 0.9091,
478
+ "mean_token_accuracy": 0.7138416171073914,
479
+ "num_tokens": 24465171.0,
480
+ "step": 47
481
+ },
482
+ {
483
+ "entropy": 0.9132385849952698,
484
+ "epoch": 0.32432432432432434,
485
+ "grad_norm": 0.015293029136955738,
486
+ "learning_rate": 0.00014125874125874125,
487
+ "loss": 0.9102,
488
+ "mean_token_accuracy": 0.7141794562339783,
489
+ "num_tokens": 24986963.0,
490
+ "step": 48
491
+ },
492
+ {
493
+ "entropy": 0.9064264893531799,
494
+ "epoch": 0.3310810810810811,
495
+ "grad_norm": 0.01573154330253601,
496
+ "learning_rate": 0.00013986013986013986,
497
+ "loss": 0.9104,
498
+ "mean_token_accuracy": 0.7143970727920532,
499
+ "num_tokens": 25507940.0,
500
+ "step": 49
501
+ },
502
+ {
503
+ "entropy": 0.9247837662696838,
504
+ "epoch": 0.33783783783783783,
505
+ "grad_norm": 0.01566915400326252,
506
+ "learning_rate": 0.00013846153846153847,
507
+ "loss": 0.9244,
508
+ "mean_token_accuracy": 0.7100151181221008,
509
+ "num_tokens": 26029633.0,
510
+ "step": 50
511
+ },
512
+ {
513
+ "entropy": 0.8918017148971558,
514
+ "epoch": 0.34459459459459457,
515
+ "grad_norm": 0.016166144981980324,
516
+ "learning_rate": 0.00013706293706293706,
517
+ "loss": 0.8896,
518
+ "mean_token_accuracy": 0.7206509709358215,
519
+ "num_tokens": 26551330.0,
520
+ "step": 51
521
+ },
522
+ {
523
+ "entropy": 0.8915703296661377,
524
+ "epoch": 0.35135135135135137,
525
+ "grad_norm": 0.016258137300610542,
526
+ "learning_rate": 0.00013566433566433568,
527
+ "loss": 0.8838,
528
+ "mean_token_accuracy": 0.7210925817489624,
529
+ "num_tokens": 27071628.0,
530
+ "step": 52
531
+ },
532
+ {
533
+ "entropy": 0.910973310470581,
534
+ "epoch": 0.3581081081081081,
535
+ "grad_norm": 0.015521145425736904,
536
+ "learning_rate": 0.0001342657342657343,
537
+ "loss": 0.9066,
538
+ "mean_token_accuracy": 0.7141618132591248,
539
+ "num_tokens": 27592998.0,
540
+ "step": 53
541
+ },
542
+ {
543
+ "entropy": 0.9083860516548157,
544
+ "epoch": 0.36486486486486486,
545
+ "grad_norm": 0.01575257070362568,
546
+ "learning_rate": 0.00013286713286713288,
547
+ "loss": 0.9014,
548
+ "mean_token_accuracy": 0.7151919603347778,
549
+ "num_tokens": 28115257.0,
550
+ "step": 54
551
+ },
552
+ {
553
+ "entropy": 0.9026190638542175,
554
+ "epoch": 0.3716216216216216,
555
+ "grad_norm": 0.01594236120581627,
556
+ "learning_rate": 0.00013146853146853147,
557
+ "loss": 0.8943,
558
+ "mean_token_accuracy": 0.7175394296646118,
559
+ "num_tokens": 28636147.0,
560
+ "step": 55
561
+ },
562
+ {
563
+ "entropy": 0.8969719409942627,
564
+ "epoch": 0.3783783783783784,
565
+ "grad_norm": 0.016330119222402573,
566
+ "learning_rate": 0.00013006993006993008,
567
+ "loss": 0.8973,
568
+ "mean_token_accuracy": 0.7174832820892334,
569
+ "num_tokens": 29158715.0,
570
+ "step": 56
571
+ },
572
+ {
573
+ "entropy": 0.9341723322868347,
574
+ "epoch": 0.38513513513513514,
575
+ "grad_norm": 0.01603098399937153,
576
+ "learning_rate": 0.00012867132867132867,
577
+ "loss": 0.9331,
578
+ "mean_token_accuracy": 0.7069593667984009,
579
+ "num_tokens": 29681317.0,
580
+ "step": 57
581
+ },
582
+ {
583
+ "entropy": 0.8998703360557556,
584
+ "epoch": 0.3918918918918919,
585
+ "grad_norm": 0.016095977276563644,
586
+ "learning_rate": 0.00012727272727272728,
587
+ "loss": 0.8929,
588
+ "mean_token_accuracy": 0.718587338924408,
589
+ "num_tokens": 30202981.0,
590
+ "step": 58
591
+ },
592
+ {
593
+ "entropy": 0.9057657718658447,
594
+ "epoch": 0.39864864864864863,
595
+ "grad_norm": 0.016768187284469604,
596
+ "learning_rate": 0.00012587412587412587,
597
+ "loss": 0.9004,
598
+ "mean_token_accuracy": 0.7168737649917603,
599
+ "num_tokens": 30717612.0,
600
+ "step": 59
601
+ },
602
+ {
603
+ "entropy": 0.898177981376648,
604
+ "epoch": 0.40540540540540543,
605
+ "grad_norm": 0.01631537266075611,
606
+ "learning_rate": 0.00012447552447552448,
607
+ "loss": 0.8949,
608
+ "mean_token_accuracy": 0.7172833681106567,
609
+ "num_tokens": 31239946.0,
610
+ "step": 60
611
+ },
612
+ {
613
+ "entropy": 0.8847284317016602,
614
+ "epoch": 0.41216216216216217,
615
+ "grad_norm": 0.01665564626455307,
616
+ "learning_rate": 0.0001230769230769231,
617
+ "loss": 0.88,
618
+ "mean_token_accuracy": 0.7212061285972595,
619
+ "num_tokens": 31761778.0,
620
+ "step": 61
621
+ },
622
+ {
623
+ "entropy": 0.8945165872573853,
624
+ "epoch": 0.4189189189189189,
625
+ "grad_norm": 0.01744748093187809,
626
+ "learning_rate": 0.0001216783216783217,
627
+ "loss": 0.893,
628
+ "mean_token_accuracy": 0.7190291285514832,
629
+ "num_tokens": 32284035.0,
630
+ "step": 62
631
+ },
632
+ {
633
+ "entropy": 0.9090993404388428,
634
+ "epoch": 0.42567567567567566,
635
+ "grad_norm": 0.017030267044901848,
636
+ "learning_rate": 0.00012027972027972027,
637
+ "loss": 0.9082,
638
+ "mean_token_accuracy": 0.7141250967979431,
639
+ "num_tokens": 32804596.0,
640
+ "step": 63
641
+ },
642
+ {
643
+ "entropy": 0.9153857231140137,
644
+ "epoch": 0.43243243243243246,
645
+ "grad_norm": 0.016640154644846916,
646
+ "learning_rate": 0.00011888111888111889,
647
+ "loss": 0.9149,
648
+ "mean_token_accuracy": 0.7134872078895569,
649
+ "num_tokens": 33325570.0,
650
+ "step": 64
651
+ },
652
+ {
653
+ "entropy": 0.8797224164009094,
654
+ "epoch": 0.4391891891891892,
655
+ "grad_norm": 0.01646554283797741,
656
+ "learning_rate": 0.00011748251748251749,
657
+ "loss": 0.8781,
658
+ "mean_token_accuracy": 0.7216721177101135,
659
+ "num_tokens": 33846540.0,
660
+ "step": 65
661
+ },
662
+ {
663
+ "entropy": 0.8971645832061768,
664
+ "epoch": 0.44594594594594594,
665
+ "grad_norm": 0.016337089240550995,
666
+ "learning_rate": 0.00011608391608391609,
667
+ "loss": 0.8935,
668
+ "mean_token_accuracy": 0.7179019451141357,
669
+ "num_tokens": 34366891.0,
670
+ "step": 66
671
+ },
672
+ {
673
+ "entropy": 0.904381513595581,
674
+ "epoch": 0.4527027027027027,
675
+ "grad_norm": 0.017709996551275253,
676
+ "learning_rate": 0.00011468531468531469,
677
+ "loss": 0.8984,
678
+ "mean_token_accuracy": 0.7170865535736084,
679
+ "num_tokens": 34889245.0,
680
+ "step": 67
681
+ },
682
+ {
683
+ "entropy": 0.9063211679458618,
684
+ "epoch": 0.4594594594594595,
685
+ "grad_norm": 0.017201313748955727,
686
+ "learning_rate": 0.0001132867132867133,
687
+ "loss": 0.9015,
688
+ "mean_token_accuracy": 0.7147793173789978,
689
+ "num_tokens": 35404556.0,
690
+ "step": 68
691
+ },
692
+ {
693
+ "entropy": 0.8922293186187744,
694
+ "epoch": 0.46621621621621623,
695
+ "grad_norm": 0.016904350370168686,
696
+ "learning_rate": 0.0001118881118881119,
697
+ "loss": 0.888,
698
+ "mean_token_accuracy": 0.719697892665863,
699
+ "num_tokens": 35925065.0,
700
+ "step": 69
701
+ },
702
+ {
703
+ "entropy": 0.9034209251403809,
704
+ "epoch": 0.47297297297297297,
705
+ "grad_norm": 0.017079392448067665,
706
+ "learning_rate": 0.00011048951048951048,
707
+ "loss": 0.896,
708
+ "mean_token_accuracy": 0.7172802686691284,
709
+ "num_tokens": 36446132.0,
710
+ "step": 70
711
+ },
712
+ {
713
+ "entropy": 0.887146532535553,
714
+ "epoch": 0.4797297297297297,
715
+ "grad_norm": 0.01738973893225193,
716
+ "learning_rate": 0.00010909090909090909,
717
+ "loss": 0.8828,
718
+ "mean_token_accuracy": 0.7213963270187378,
719
+ "num_tokens": 36967214.0,
720
+ "step": 71
721
+ },
722
+ {
723
+ "entropy": 0.8868647813796997,
724
+ "epoch": 0.4864864864864865,
725
+ "grad_norm": 0.017861152067780495,
726
+ "learning_rate": 0.0001076923076923077,
727
+ "loss": 0.8773,
728
+ "mean_token_accuracy": 0.7218159437179565,
729
+ "num_tokens": 37488377.0,
730
+ "step": 72
731
+ },
732
+ {
733
+ "entropy": 0.9134626388549805,
734
+ "epoch": 0.49324324324324326,
735
+ "grad_norm": 0.017511827871203423,
736
+ "learning_rate": 0.0001062937062937063,
737
+ "loss": 0.9122,
738
+ "mean_token_accuracy": 0.7130904793739319,
739
+ "num_tokens": 38008473.0,
740
+ "step": 73
741
+ },
742
+ {
743
+ "entropy": 0.8864673376083374,
744
+ "epoch": 0.5,
745
+ "grad_norm": 0.017983395606279373,
746
+ "learning_rate": 0.0001048951048951049,
747
+ "loss": 0.8881,
748
+ "mean_token_accuracy": 0.7188193798065186,
749
+ "num_tokens": 38529415.0,
750
+ "step": 74
751
+ },
752
+ {
753
+ "entropy": 0.9193586111068726,
754
+ "epoch": 0.5067567567567568,
755
+ "grad_norm": 0.018130991607904434,
756
+ "learning_rate": 0.00010349650349650351,
757
+ "loss": 0.9259,
758
+ "mean_token_accuracy": 0.7099359631538391,
759
+ "num_tokens": 39049712.0,
760
+ "step": 75
761
+ },
762
+ {
763
+ "entropy": 0.8895922303199768,
764
+ "epoch": 0.5135135135135135,
765
+ "grad_norm": 0.017333725467324257,
766
+ "learning_rate": 0.00010209790209790211,
767
+ "loss": 0.8891,
768
+ "mean_token_accuracy": 0.7198699116706848,
769
+ "num_tokens": 39566565.0,
770
+ "step": 76
771
+ },
772
+ {
773
+ "entropy": 0.8800663948059082,
774
+ "epoch": 0.5202702702702703,
775
+ "grad_norm": 0.018249373883008957,
776
+ "learning_rate": 0.00010069930069930071,
777
+ "loss": 0.8692,
778
+ "mean_token_accuracy": 0.724394679069519,
779
+ "num_tokens": 40086868.0,
780
+ "step": 77
781
+ },
782
+ {
783
+ "entropy": 0.8940162658691406,
784
+ "epoch": 0.527027027027027,
785
+ "grad_norm": 0.01744958944618702,
786
+ "learning_rate": 9.930069930069931e-05,
787
+ "loss": 0.8898,
788
+ "mean_token_accuracy": 0.7197460532188416,
789
+ "num_tokens": 40607334.0,
790
+ "step": 78
791
+ },
792
+ {
793
+ "entropy": 0.8987851142883301,
794
+ "epoch": 0.5337837837837838,
795
+ "grad_norm": 0.018518365919589996,
796
+ "learning_rate": 9.790209790209791e-05,
797
+ "loss": 0.8904,
798
+ "mean_token_accuracy": 0.7207316160202026,
799
+ "num_tokens": 41126753.0,
800
+ "step": 79
801
+ },
802
+ {
803
+ "entropy": 0.8804575204849243,
804
+ "epoch": 0.5405405405405406,
805
+ "grad_norm": 0.018223201856017113,
806
+ "learning_rate": 9.65034965034965e-05,
807
+ "loss": 0.8763,
808
+ "mean_token_accuracy": 0.7228670120239258,
809
+ "num_tokens": 41647670.0,
810
+ "step": 80
811
+ },
812
+ {
813
+ "entropy": 0.8889448046684265,
814
+ "epoch": 0.5472972972972973,
815
+ "grad_norm": 0.018730709329247475,
816
+ "learning_rate": 9.510489510489511e-05,
817
+ "loss": 0.883,
818
+ "mean_token_accuracy": 0.7205345630645752,
819
+ "num_tokens": 42166183.0,
820
+ "step": 81
821
+ },
822
+ {
823
+ "entropy": 0.8915292024612427,
824
+ "epoch": 0.5540540540540541,
825
+ "grad_norm": 0.018218854442238808,
826
+ "learning_rate": 9.370629370629372e-05,
827
+ "loss": 0.8835,
828
+ "mean_token_accuracy": 0.7211962342262268,
829
+ "num_tokens": 42685978.0,
830
+ "step": 82
831
+ },
832
+ {
833
+ "entropy": 0.871468186378479,
834
+ "epoch": 0.5608108108108109,
835
+ "grad_norm": 0.0187361016869545,
836
+ "learning_rate": 9.230769230769232e-05,
837
+ "loss": 0.8697,
838
+ "mean_token_accuracy": 0.7244738340377808,
839
+ "num_tokens": 43203743.0,
840
+ "step": 83
841
+ },
842
+ {
843
+ "entropy": 0.8702860474586487,
844
+ "epoch": 0.5675675675675675,
845
+ "grad_norm": 0.018368471413850784,
846
+ "learning_rate": 9.090909090909092e-05,
847
+ "loss": 0.8698,
848
+ "mean_token_accuracy": 0.7259374260902405,
849
+ "num_tokens": 43725363.0,
850
+ "step": 84
851
+ },
852
+ {
853
+ "entropy": 0.8703951239585876,
854
+ "epoch": 0.5743243243243243,
855
+ "grad_norm": 0.01838189922273159,
856
+ "learning_rate": 8.951048951048952e-05,
857
+ "loss": 0.8743,
858
+ "mean_token_accuracy": 0.7234249711036682,
859
+ "num_tokens": 44246559.0,
860
+ "step": 85
861
+ },
862
+ {
863
+ "entropy": 0.8820457458496094,
864
+ "epoch": 0.581081081081081,
865
+ "grad_norm": 0.019160225987434387,
866
+ "learning_rate": 8.811188811188812e-05,
867
+ "loss": 0.8849,
868
+ "mean_token_accuracy": 0.7206857800483704,
869
+ "num_tokens": 44769062.0,
870
+ "step": 86
871
+ },
872
+ {
873
+ "entropy": 0.9152972102165222,
874
+ "epoch": 0.5878378378378378,
875
+ "grad_norm": 0.019004985690116882,
876
+ "learning_rate": 8.67132867132867e-05,
877
+ "loss": 0.9153,
878
+ "mean_token_accuracy": 0.7120697498321533,
879
+ "num_tokens": 45286787.0,
880
+ "step": 87
881
+ },
882
+ {
883
+ "entropy": 0.904834508895874,
884
+ "epoch": 0.5945945945945946,
885
+ "grad_norm": 0.018431641161441803,
886
+ "learning_rate": 8.531468531468532e-05,
887
+ "loss": 0.9011,
888
+ "mean_token_accuracy": 0.7159322500228882,
889
+ "num_tokens": 45807531.0,
890
+ "step": 88
891
+ },
892
+ {
893
+ "entropy": 0.9021150469779968,
894
+ "epoch": 0.6013513513513513,
895
+ "grad_norm": 0.01898609660565853,
896
+ "learning_rate": 8.391608391608392e-05,
897
+ "loss": 0.8956,
898
+ "mean_token_accuracy": 0.7188680768013,
899
+ "num_tokens": 46321183.0,
900
+ "step": 89
901
+ },
902
+ {
903
+ "entropy": 0.9032172560691833,
904
+ "epoch": 0.6081081081081081,
905
+ "grad_norm": 0.02004328928887844,
906
+ "learning_rate": 8.251748251748252e-05,
907
+ "loss": 0.8928,
908
+ "mean_token_accuracy": 0.7190265655517578,
909
+ "num_tokens": 46841033.0,
910
+ "step": 90
911
+ },
912
+ {
913
+ "entropy": 0.9000004529953003,
914
+ "epoch": 0.6148648648648649,
915
+ "grad_norm": 0.019782939925789833,
916
+ "learning_rate": 8.111888111888112e-05,
917
+ "loss": 0.8849,
918
+ "mean_token_accuracy": 0.7195360064506531,
919
+ "num_tokens": 47363499.0,
920
+ "step": 91
921
+ },
922
+ {
923
+ "entropy": 0.8834792375564575,
924
+ "epoch": 0.6216216216216216,
925
+ "grad_norm": 0.0185946486890316,
926
+ "learning_rate": 7.972027972027972e-05,
927
+ "loss": 0.8762,
928
+ "mean_token_accuracy": 0.723339319229126,
929
+ "num_tokens": 47884207.0,
930
+ "step": 92
931
+ },
932
+ {
933
+ "entropy": 0.9149696826934814,
934
+ "epoch": 0.6283783783783784,
935
+ "grad_norm": 0.018683424219489098,
936
+ "learning_rate": 7.832167832167832e-05,
937
+ "loss": 0.9166,
938
+ "mean_token_accuracy": 0.7118301391601562,
939
+ "num_tokens": 48405854.0,
940
+ "step": 93
941
+ },
942
+ {
943
+ "entropy": 0.8827645182609558,
944
+ "epoch": 0.6351351351351351,
945
+ "grad_norm": 0.02002580091357231,
946
+ "learning_rate": 7.692307692307693e-05,
947
+ "loss": 0.8823,
948
+ "mean_token_accuracy": 0.7205896377563477,
949
+ "num_tokens": 48923152.0,
950
+ "step": 94
951
+ },
952
+ {
953
+ "entropy": 0.8913484811782837,
954
+ "epoch": 0.6418918918918919,
955
+ "grad_norm": 0.01915843039751053,
956
+ "learning_rate": 7.552447552447553e-05,
957
+ "loss": 0.8938,
958
+ "mean_token_accuracy": 0.7178173065185547,
959
+ "num_tokens": 49445126.0,
960
+ "step": 95
961
+ },
962
+ {
963
+ "entropy": 0.8866020441055298,
964
+ "epoch": 0.6486486486486487,
965
+ "grad_norm": 0.020832480862736702,
966
+ "learning_rate": 7.412587412587413e-05,
967
+ "loss": 0.8917,
968
+ "mean_token_accuracy": 0.7187622785568237,
969
+ "num_tokens": 49967007.0,
970
+ "step": 96
971
+ },
972
+ {
973
+ "entropy": 0.8766802549362183,
974
+ "epoch": 0.6554054054054054,
975
+ "grad_norm": 0.019703548401594162,
976
+ "learning_rate": 7.272727272727273e-05,
977
+ "loss": 0.8714,
978
+ "mean_token_accuracy": 0.7235036492347717,
979
+ "num_tokens": 50488513.0,
980
+ "step": 97
981
+ },
982
+ {
983
+ "entropy": 0.9040693044662476,
984
+ "epoch": 0.6621621621621622,
985
+ "grad_norm": 0.0192877184599638,
986
+ "learning_rate": 7.132867132867134e-05,
987
+ "loss": 0.9035,
988
+ "mean_token_accuracy": 0.7158621549606323,
989
+ "num_tokens": 51008094.0,
990
+ "step": 98
991
+ },
992
+ {
993
+ "entropy": 0.8829696178436279,
994
+ "epoch": 0.668918918918919,
995
+ "grad_norm": 0.01927708089351654,
996
+ "learning_rate": 6.993006993006993e-05,
997
+ "loss": 0.8797,
998
+ "mean_token_accuracy": 0.7219703793525696,
999
+ "num_tokens": 51529469.0,
1000
+ "step": 99
1001
+ },
1002
+ {
1003
+ "entropy": 0.894615650177002,
1004
+ "epoch": 0.6756756756756757,
1005
+ "grad_norm": 0.01965499296784401,
1006
+ "learning_rate": 6.853146853146853e-05,
1007
+ "loss": 0.8882,
1008
+ "mean_token_accuracy": 0.7190660238265991,
1009
+ "num_tokens": 52050576.0,
1010
+ "step": 100
1011
+ },
1012
+ {
1013
+ "entropy": 0.8754645586013794,
1014
+ "epoch": 0.6824324324324325,
1015
+ "grad_norm": 0.019854635000228882,
1016
+ "learning_rate": 6.713286713286715e-05,
1017
+ "loss": 0.869,
1018
+ "mean_token_accuracy": 0.72502201795578,
1019
+ "num_tokens": 52571347.0,
1020
+ "step": 101
1021
+ },
1022
+ {
1023
+ "entropy": 0.8882539868354797,
1024
+ "epoch": 0.6891891891891891,
1025
+ "grad_norm": 0.020126935094594955,
1026
+ "learning_rate": 6.573426573426573e-05,
1027
+ "loss": 0.8766,
1028
+ "mean_token_accuracy": 0.7229670882225037,
1029
+ "num_tokens": 53092933.0,
1030
+ "step": 102
1031
+ },
1032
+ {
1033
+ "entropy": 0.9050229787826538,
1034
+ "epoch": 0.6959459459459459,
1035
+ "grad_norm": 0.019794149324297905,
1036
+ "learning_rate": 6.433566433566433e-05,
1037
+ "loss": 0.8965,
1038
+ "mean_token_accuracy": 0.7169359922409058,
1039
+ "num_tokens": 53614658.0,
1040
+ "step": 103
1041
+ },
1042
+ {
1043
+ "entropy": 0.900147557258606,
1044
+ "epoch": 0.7027027027027027,
1045
+ "grad_norm": 0.01930818147957325,
1046
+ "learning_rate": 6.293706293706293e-05,
1047
+ "loss": 0.8975,
1048
+ "mean_token_accuracy": 0.716631293296814,
1049
+ "num_tokens": 54134578.0,
1050
+ "step": 104
1051
+ },
1052
+ {
1053
+ "entropy": 0.8568655252456665,
1054
+ "epoch": 0.7094594594594594,
1055
+ "grad_norm": 0.019813908264040947,
1056
+ "learning_rate": 6.153846153846155e-05,
1057
+ "loss": 0.8549,
1058
+ "mean_token_accuracy": 0.7281835079193115,
1059
+ "num_tokens": 54656493.0,
1060
+ "step": 105
1061
+ },
1062
+ {
1063
+ "entropy": 0.8811562061309814,
1064
+ "epoch": 0.7162162162162162,
1065
+ "grad_norm": 0.02051232010126114,
1066
+ "learning_rate": 6.0139860139860136e-05,
1067
+ "loss": 0.8798,
1068
+ "mean_token_accuracy": 0.7215043902397156,
1069
+ "num_tokens": 55177423.0,
1070
+ "step": 106
1071
+ },
1072
+ {
1073
+ "entropy": 0.8680734634399414,
1074
+ "epoch": 0.722972972972973,
1075
+ "grad_norm": 0.02060469426214695,
1076
+ "learning_rate": 5.8741258741258744e-05,
1077
+ "loss": 0.8739,
1078
+ "mean_token_accuracy": 0.7240718603134155,
1079
+ "num_tokens": 55698753.0,
1080
+ "step": 107
1081
+ },
1082
+ {
1083
+ "entropy": 0.8927019238471985,
1084
+ "epoch": 0.7297297297297297,
1085
+ "grad_norm": 0.020770812407135963,
1086
+ "learning_rate": 5.7342657342657345e-05,
1087
+ "loss": 0.8931,
1088
+ "mean_token_accuracy": 0.7185073494911194,
1089
+ "num_tokens": 56208445.0,
1090
+ "step": 108
1091
+ },
1092
+ {
1093
+ "entropy": 0.8723157644271851,
1094
+ "epoch": 0.7364864864864865,
1095
+ "grad_norm": 0.020027851685881615,
1096
+ "learning_rate": 5.594405594405595e-05,
1097
+ "loss": 0.8677,
1098
+ "mean_token_accuracy": 0.724934995174408,
1099
+ "num_tokens": 56727266.0,
1100
+ "step": 109
1101
+ },
1102
+ {
1103
+ "entropy": 0.885744571685791,
1104
+ "epoch": 0.7432432432432432,
1105
+ "grad_norm": 0.019678086042404175,
1106
+ "learning_rate": 5.4545454545454546e-05,
1107
+ "loss": 0.8811,
1108
+ "mean_token_accuracy": 0.7209365367889404,
1109
+ "num_tokens": 57248166.0,
1110
+ "step": 110
1111
+ },
1112
+ {
1113
+ "entropy": 0.8876606225967407,
1114
+ "epoch": 0.75,
1115
+ "grad_norm": 0.020123396068811417,
1116
+ "learning_rate": 5.314685314685315e-05,
1117
+ "loss": 0.8834,
1118
+ "mean_token_accuracy": 0.7213138937950134,
1119
+ "num_tokens": 57769996.0,
1120
+ "step": 111
1121
+ },
1122
+ {
1123
+ "entropy": 0.9018759727478027,
1124
+ "epoch": 0.7567567567567568,
1125
+ "grad_norm": 0.02047978714108467,
1126
+ "learning_rate": 5.1748251748251755e-05,
1127
+ "loss": 0.8945,
1128
+ "mean_token_accuracy": 0.7177107930183411,
1129
+ "num_tokens": 58291779.0,
1130
+ "step": 112
1131
+ },
1132
+ {
1133
+ "entropy": 0.9049036502838135,
1134
+ "epoch": 0.7635135135135135,
1135
+ "grad_norm": 0.020464390516281128,
1136
+ "learning_rate": 5.0349650349650356e-05,
1137
+ "loss": 0.8984,
1138
+ "mean_token_accuracy": 0.7173320651054382,
1139
+ "num_tokens": 58805386.0,
1140
+ "step": 113
1141
+ },
1142
+ {
1143
+ "entropy": 0.8888832330703735,
1144
+ "epoch": 0.7702702702702703,
1145
+ "grad_norm": 0.01985686831176281,
1146
+ "learning_rate": 4.8951048951048956e-05,
1147
+ "loss": 0.882,
1148
+ "mean_token_accuracy": 0.7208877205848694,
1149
+ "num_tokens": 59324642.0,
1150
+ "step": 114
1151
+ },
1152
+ {
1153
+ "entropy": 0.9020118117332458,
1154
+ "epoch": 0.777027027027027,
1155
+ "grad_norm": 0.020026598125696182,
1156
+ "learning_rate": 4.755244755244756e-05,
1157
+ "loss": 0.8973,
1158
+ "mean_token_accuracy": 0.7182012796401978,
1159
+ "num_tokens": 59847011.0,
1160
+ "step": 115
1161
+ },
1162
+ {
1163
+ "entropy": 0.8986602425575256,
1164
+ "epoch": 0.7837837837837838,
1165
+ "grad_norm": 0.01975986920297146,
1166
+ "learning_rate": 4.615384615384616e-05,
1167
+ "loss": 0.8968,
1168
+ "mean_token_accuracy": 0.7171120047569275,
1169
+ "num_tokens": 60369152.0,
1170
+ "step": 116
1171
+ },
1172
+ {
1173
+ "entropy": 0.8851807117462158,
1174
+ "epoch": 0.7905405405405406,
1175
+ "grad_norm": 0.01993614062666893,
1176
+ "learning_rate": 4.475524475524476e-05,
1177
+ "loss": 0.8804,
1178
+ "mean_token_accuracy": 0.7222026586532593,
1179
+ "num_tokens": 60889245.0,
1180
+ "step": 117
1181
+ },
1182
+ {
1183
+ "entropy": 0.882935643196106,
1184
+ "epoch": 0.7972972972972973,
1185
+ "grad_norm": 0.01959838718175888,
1186
+ "learning_rate": 4.335664335664335e-05,
1187
+ "loss": 0.8797,
1188
+ "mean_token_accuracy": 0.7208013534545898,
1189
+ "num_tokens": 61407350.0,
1190
+ "step": 118
1191
+ },
1192
+ {
1193
+ "entropy": 0.8777122497558594,
1194
+ "epoch": 0.8040540540540541,
1195
+ "grad_norm": 0.0199885256588459,
1196
+ "learning_rate": 4.195804195804196e-05,
1197
+ "loss": 0.8715,
1198
+ "mean_token_accuracy": 0.725073516368866,
1199
+ "num_tokens": 61928742.0,
1200
+ "step": 119
1201
+ },
1202
+ {
1203
+ "entropy": 0.850617527961731,
1204
+ "epoch": 0.8108108108108109,
1205
+ "grad_norm": 0.020385252311825752,
1206
+ "learning_rate": 4.055944055944056e-05,
1207
+ "loss": 0.8499,
1208
+ "mean_token_accuracy": 0.730745792388916,
1209
+ "num_tokens": 62446355.0,
1210
+ "step": 120
1211
+ },
1212
+ {
1213
+ "entropy": 0.8720276355743408,
1214
+ "epoch": 0.8175675675675675,
1215
+ "grad_norm": 0.02067047357559204,
1216
+ "learning_rate": 3.916083916083916e-05,
1217
+ "loss": 0.8746,
1218
+ "mean_token_accuracy": 0.7231326699256897,
1219
+ "num_tokens": 62967935.0,
1220
+ "step": 121
1221
+ },
1222
+ {
1223
+ "entropy": 0.9207990169525146,
1224
+ "epoch": 0.8243243243243243,
1225
+ "grad_norm": 0.02032148465514183,
1226
+ "learning_rate": 3.776223776223776e-05,
1227
+ "loss": 0.922,
1228
+ "mean_token_accuracy": 0.710451066493988,
1229
+ "num_tokens": 63489615.0,
1230
+ "step": 122
1231
+ },
1232
+ {
1233
+ "entropy": 0.9047123193740845,
1234
+ "epoch": 0.831081081081081,
1235
+ "grad_norm": 0.020533205941319466,
1236
+ "learning_rate": 3.6363636363636364e-05,
1237
+ "loss": 0.9015,
1238
+ "mean_token_accuracy": 0.7166460752487183,
1239
+ "num_tokens": 64010947.0,
1240
+ "step": 123
1241
+ },
1242
+ {
1243
+ "entropy": 0.8477683067321777,
1244
+ "epoch": 0.8378378378378378,
1245
+ "grad_norm": 0.019841615110635757,
1246
+ "learning_rate": 3.4965034965034965e-05,
1247
+ "loss": 0.8415,
1248
+ "mean_token_accuracy": 0.7314550876617432,
1249
+ "num_tokens": 64531314.0,
1250
+ "step": 124
1251
+ },
1252
+ {
1253
+ "entropy": 0.8877344131469727,
1254
+ "epoch": 0.8445945945945946,
1255
+ "grad_norm": 0.01969732716679573,
1256
+ "learning_rate": 3.356643356643357e-05,
1257
+ "loss": 0.8858,
1258
+ "mean_token_accuracy": 0.7207072973251343,
1259
+ "num_tokens": 65053203.0,
1260
+ "step": 125
1261
+ },
1262
+ {
1263
+ "entropy": 0.8977670669555664,
1264
+ "epoch": 0.8513513513513513,
1265
+ "grad_norm": 0.01998847909271717,
1266
+ "learning_rate": 3.216783216783217e-05,
1267
+ "loss": 0.8914,
1268
+ "mean_token_accuracy": 0.7174409627914429,
1269
+ "num_tokens": 65574215.0,
1270
+ "step": 126
1271
+ },
1272
+ {
1273
+ "entropy": 0.8922737836837769,
1274
+ "epoch": 0.8581081081081081,
1275
+ "grad_norm": 0.02041775733232498,
1276
+ "learning_rate": 3.0769230769230774e-05,
1277
+ "loss": 0.8918,
1278
+ "mean_token_accuracy": 0.7177229523658752,
1279
+ "num_tokens": 66095688.0,
1280
+ "step": 127
1281
+ },
1282
+ {
1283
+ "entropy": 0.8866901397705078,
1284
+ "epoch": 0.8648648648648649,
1285
+ "grad_norm": 0.02134627476334572,
1286
+ "learning_rate": 2.9370629370629372e-05,
1287
+ "loss": 0.877,
1288
+ "mean_token_accuracy": 0.7219728827476501,
1289
+ "num_tokens": 66608762.0,
1290
+ "step": 128
1291
+ },
1292
+ {
1293
+ "entropy": 0.8762195110321045,
1294
+ "epoch": 0.8716216216216216,
1295
+ "grad_norm": 0.020530981943011284,
1296
+ "learning_rate": 2.7972027972027976e-05,
1297
+ "loss": 0.8705,
1298
+ "mean_token_accuracy": 0.7246843576431274,
1299
+ "num_tokens": 67128373.0,
1300
+ "step": 129
1301
+ },
1302
+ {
1303
+ "entropy": 0.8929407596588135,
1304
+ "epoch": 0.8783783783783784,
1305
+ "grad_norm": 0.021080242469906807,
1306
+ "learning_rate": 2.6573426573426574e-05,
1307
+ "loss": 0.883,
1308
+ "mean_token_accuracy": 0.7201827168464661,
1309
+ "num_tokens": 67645584.0,
1310
+ "step": 130
1311
+ },
1312
+ {
1313
+ "entropy": 0.8947334289550781,
1314
+ "epoch": 0.8851351351351351,
1315
+ "grad_norm": 0.021501585841178894,
1316
+ "learning_rate": 2.5174825174825178e-05,
1317
+ "loss": 0.8867,
1318
+ "mean_token_accuracy": 0.7202873826026917,
1319
+ "num_tokens": 68163799.0,
1320
+ "step": 131
1321
+ },
1322
+ {
1323
+ "entropy": 0.8815573453903198,
1324
+ "epoch": 0.8918918918918919,
1325
+ "grad_norm": 0.02011815272271633,
1326
+ "learning_rate": 2.377622377622378e-05,
1327
+ "loss": 0.8755,
1328
+ "mean_token_accuracy": 0.722142219543457,
1329
+ "num_tokens": 68684635.0,
1330
+ "step": 132
1331
+ },
1332
+ {
1333
+ "entropy": 0.8770313262939453,
1334
+ "epoch": 0.8986486486486487,
1335
+ "grad_norm": 0.021030854433774948,
1336
+ "learning_rate": 2.237762237762238e-05,
1337
+ "loss": 0.8726,
1338
+ "mean_token_accuracy": 0.7231403589248657,
1339
+ "num_tokens": 69205861.0,
1340
+ "step": 133
1341
+ },
1342
+ {
1343
+ "entropy": 0.8650751709938049,
1344
+ "epoch": 0.9054054054054054,
1345
+ "grad_norm": 0.020364264026284218,
1346
+ "learning_rate": 2.097902097902098e-05,
1347
+ "loss": 0.8633,
1348
+ "mean_token_accuracy": 0.7260516881942749,
1349
+ "num_tokens": 69728254.0,
1350
+ "step": 134
1351
+ },
1352
+ {
1353
+ "entropy": 0.9033790230751038,
1354
+ "epoch": 0.9121621621621622,
1355
+ "grad_norm": 0.02128477208316326,
1356
+ "learning_rate": 1.958041958041958e-05,
1357
+ "loss": 0.9083,
1358
+ "mean_token_accuracy": 0.7141423225402832,
1359
+ "num_tokens": 70248446.0,
1360
+ "step": 135
1361
+ },
1362
+ {
1363
+ "entropy": 0.8698260188102722,
1364
+ "epoch": 0.918918918918919,
1365
+ "grad_norm": 0.020461006090044975,
1366
+ "learning_rate": 1.8181818181818182e-05,
1367
+ "loss": 0.8678,
1368
+ "mean_token_accuracy": 0.7250317335128784,
1369
+ "num_tokens": 70768977.0,
1370
+ "step": 136
1371
+ },
1372
+ {
1373
+ "entropy": 0.9101998805999756,
1374
+ "epoch": 0.9256756756756757,
1375
+ "grad_norm": 0.021351408213377,
1376
+ "learning_rate": 1.6783216783216786e-05,
1377
+ "loss": 0.9156,
1378
+ "mean_token_accuracy": 0.7119852304458618,
1379
+ "num_tokens": 71285104.0,
1380
+ "step": 137
1381
+ },
1382
+ {
1383
+ "entropy": 0.8741437196731567,
1384
+ "epoch": 0.9324324324324325,
1385
+ "grad_norm": 0.02133285254240036,
1386
+ "learning_rate": 1.5384615384615387e-05,
1387
+ "loss": 0.8756,
1388
+ "mean_token_accuracy": 0.7225217819213867,
1389
+ "num_tokens": 71806248.0,
1390
+ "step": 138
1391
+ },
1392
+ {
1393
+ "entropy": 0.8736119270324707,
1394
+ "epoch": 0.9391891891891891,
1395
+ "grad_norm": 0.020086556673049927,
1396
+ "learning_rate": 1.3986013986013988e-05,
1397
+ "loss": 0.8683,
1398
+ "mean_token_accuracy": 0.7247164249420166,
1399
+ "num_tokens": 72328759.0,
1400
+ "step": 139
1401
+ },
1402
+ {
1403
+ "entropy": 0.8891340494155884,
1404
+ "epoch": 0.9459459459459459,
1405
+ "grad_norm": 0.02030119113624096,
1406
+ "learning_rate": 1.2587412587412589e-05,
1407
+ "loss": 0.886,
1408
+ "mean_token_accuracy": 0.720429539680481,
1409
+ "num_tokens": 72848818.0,
1410
+ "step": 140
1411
+ },
1412
+ {
1413
+ "entropy": 0.9049081802368164,
1414
+ "epoch": 0.9527027027027027,
1415
+ "grad_norm": 0.020596666261553764,
1416
+ "learning_rate": 1.118881118881119e-05,
1417
+ "loss": 0.9042,
1418
+ "mean_token_accuracy": 0.7161701321601868,
1419
+ "num_tokens": 73370480.0,
1420
+ "step": 141
1421
+ },
1422
+ {
1423
+ "entropy": 0.8795987367630005,
1424
+ "epoch": 0.9594594594594594,
1425
+ "grad_norm": 0.020133303478360176,
1426
+ "learning_rate": 9.79020979020979e-06,
1427
+ "loss": 0.8769,
1428
+ "mean_token_accuracy": 0.722213089466095,
1429
+ "num_tokens": 73892465.0,
1430
+ "step": 142
1431
+ },
1432
+ {
1433
+ "entropy": 0.9042908549308777,
1434
+ "epoch": 0.9662162162162162,
1435
+ "grad_norm": 0.020722530782222748,
1436
+ "learning_rate": 8.391608391608393e-06,
1437
+ "loss": 0.9,
1438
+ "mean_token_accuracy": 0.7148555517196655,
1439
+ "num_tokens": 74407048.0,
1440
+ "step": 143
1441
+ },
1442
+ {
1443
+ "entropy": 0.8909604549407959,
1444
+ "epoch": 0.972972972972973,
1445
+ "grad_norm": 0.020139718428254128,
1446
+ "learning_rate": 6.993006993006994e-06,
1447
+ "loss": 0.8875,
1448
+ "mean_token_accuracy": 0.7199447154998779,
1449
+ "num_tokens": 74927929.0,
1450
+ "step": 144
1451
+ },
1452
+ {
1453
+ "entropy": 0.8854581117630005,
1454
+ "epoch": 0.9797297297297297,
1455
+ "grad_norm": 0.020443160086870193,
1456
+ "learning_rate": 5.594405594405595e-06,
1457
+ "loss": 0.8828,
1458
+ "mean_token_accuracy": 0.7213336825370789,
1459
+ "num_tokens": 75448350.0,
1460
+ "step": 145
1461
+ },
1462
+ {
1463
+ "entropy": 0.9017068147659302,
1464
+ "epoch": 0.9864864864864865,
1465
+ "grad_norm": 0.02036883309483528,
1466
+ "learning_rate": 4.195804195804197e-06,
1467
+ "loss": 0.8969,
1468
+ "mean_token_accuracy": 0.7174678444862366,
1469
+ "num_tokens": 75968294.0,
1470
+ "step": 146
1471
+ },
1472
+ {
1473
+ "entropy": 0.8653884530067444,
1474
+ "epoch": 0.9932432432432432,
1475
+ "grad_norm": 0.020496118813753128,
1476
+ "learning_rate": 2.7972027972027974e-06,
1477
+ "loss": 0.862,
1478
+ "mean_token_accuracy": 0.725719153881073,
1479
+ "num_tokens": 76484227.0,
1480
+ "step": 147
1481
+ },
1482
+ {
1483
+ "entropy": 0.8869370818138123,
1484
+ "epoch": 1.0,
1485
+ "grad_norm": 0.020335717126727104,
1486
+ "learning_rate": 1.3986013986013987e-06,
1487
+ "loss": 0.8832,
1488
+ "mean_token_accuracy": 0.72120600938797,
1489
+ "num_tokens": 77005516.0,
1490
+ "step": 148
1491
+ },
1492
+ {
1493
+ "epoch": 1.0,
1494
+ "step": 148,
1495
+ "total_flos": 3.219089170299355e+18,
1496
+ "train_loss": 0.901587930080053,
1497
+ "train_runtime": 1651.2377,
1498
+ "train_samples_per_second": 5.736,
1499
+ "train_steps_per_second": 0.09
1500
+ }
1501
+ ],
1502
+ "logging_steps": 1,
1503
+ "max_steps": 148,
1504
+ "num_input_tokens_seen": 0,
1505
+ "num_train_epochs": 1,
1506
+ "save_steps": 500,
1507
+ "stateful_callbacks": {
1508
+ "TrainerControl": {
1509
+ "args": {
1510
+ "should_epoch_stop": false,
1511
+ "should_evaluate": false,
1512
+ "should_log": false,
1513
+ "should_save": false,
1514
+ "should_training_stop": false
1515
+ },
1516
+ "attributes": {}
1517
+ }
1518
+ },
1519
+ "total_flos": 3.219089170299355e+18,
1520
+ "train_batch_size": 8,
1521
+ "trial_name": null,
1522
+ "trial_params": null
1523
+ }
split-1/vocab.json ADDED
The diff for this file is too large to render. See raw diff