slxhere commited on
Commit
992a607
·
verified ·
1 Parent(s): b350e86

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
2_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 1024, "out_features": 1792, "bias": true, "activation_function": "torch.nn.modules.linear.Identity"}
2_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3364399373f9291e4f72eca7e221e02b1eef7bcf5b843391627a4c8012e0bc34
3
+ size 7347360
README.md CHANGED
@@ -1,3 +1,600 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ license: mit
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - generated_from_trainer
10
+ - dataset_size:225000
11
+ - loss:MultipleNegativesRankingLoss
12
+ base_model: richinfoai/ritrieve_zh_v1
13
+ widget:
14
+ - source_sentence: 下班后和同事直奔常去的那家火锅店,热热闹闹地涮了一晚上。
15
+ sentences:
16
+ - 联延掩四远,赫弈成洪炉。
17
+ - 把酒仰问天,古今谁不死。
18
+ - 骑出平阳里,筵开卫尉家。
19
+ - source_sentence: 站在山顶看日出时,突然觉得世俗烦恼都不重要了。
20
+ sentences:
21
+ - 郁没二悲魂,萧条犹在否。
22
+ - 封疆亲日月,邑里出王公。
23
+ - 心朝玉皇帝,貌似紫阳人。
24
+ - source_sentence: 隔壁老张家两个儿子都被征走了,现在天天以泪洗面。
25
+ sentences:
26
+ - 若教为女嫁东风,除却黄莺难匹配。
27
+ - 山东今岁点行频,几处冤魂哭虏尘。
28
+ - 远图尝画地,超拜乃登坛。
29
+ - source_sentence: 边境小镇常年没人驻守,只有老李一个人在山脚下种地。
30
+ sentences:
31
+ - 海徼长无戍,湘山独种畬。
32
+ - 高名宋玉遗闲丽,作赋兰成绝盛才。
33
+ - 九衢南面色,苍翠绝纤尘。
34
+ - source_sentence: 微信列表翻到底,能说真心话的居然只剩快递群。
35
+ sentences:
36
+ - 黛消波月空蟾影,歌息梁尘有梵声。
37
+ - 代情难重论,人事好乖移。
38
+ - 时应记得长安事,曾向文场属思劳。
39
+ pipeline_tag: sentence-similarity
40
+ library_name: sentence-transformers
41
+ ---
42
+
43
+ # RITRIEVE ZH 微调:古诗 ↔ 现代语
44
+
45
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [richinfoai/ritrieve_zh_v1](https://huggingface.co/richinfoai/ritrieve_zh_v1) on the json dataset. It maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
46
+
47
+ ## Model Details
48
+
49
+ ### Model Description
50
+ - **Model Type:** Sentence Transformer
51
+ - **Base model:** [richinfoai/ritrieve_zh_v1](https://huggingface.co/richinfoai/ritrieve_zh_v1) <!-- at revision f8d5a707656c55705027678e311f9202c8ced12c -->
52
+ - **Maximum Sequence Length:** 512 tokens
53
+ - **Output Dimensionality:** 1792 dimensions
54
+ - **Similarity Function:** Cosine Similarity
55
+ - **Training Dataset:**
56
+ - json
57
+ - **Language:** zh
58
+ - **License:** mit
59
+
60
+ ### Model Sources
61
+
62
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
63
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
64
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
65
+
66
+ ### Full Model Architecture
67
+
68
+ ```
69
+ SentenceTransformer(
70
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
71
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
72
+ (2): Dense({'in_features': 1024, 'out_features': 1792, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
73
+ )
74
+ ```
75
+
76
+ ## Usage
77
+
78
+ ### Direct Usage (Sentence Transformers)
79
+
80
+ First install the Sentence Transformers library:
81
+
82
+ ```bash
83
+ pip install -U sentence-transformers
84
+ ```
85
+
86
+ Then you can load this model and run inference.
87
+ ```python
88
+ from sentence_transformers import SentenceTransformer
89
+
90
+ # Download from the 🤗 Hub
91
+ model = SentenceTransformer("sentence_transformers_model_id")
92
+ # Run inference
93
+ sentences = [
94
+ '微信列表翻到底,能说真心话的居然只剩快递群。',
95
+ '代情难重论,人事好乖移。',
96
+ '时应记得长安事,曾向文场属思劳。',
97
+ ]
98
+ embeddings = model.encode(sentences)
99
+ print(embeddings.shape)
100
+ # [3, 1792]
101
+
102
+ # Get the similarity scores for the embeddings
103
+ similarities = model.similarity(embeddings, embeddings)
104
+ print(similarities.shape)
105
+ # [3, 3]
106
+ ```
107
+
108
+ <!--
109
+ ### Direct Usage (Transformers)
110
+
111
+ <details><summary>Click to see the direct usage in Transformers</summary>
112
+
113
+ </details>
114
+ -->
115
+
116
+ <!--
117
+ ### Downstream Usage (Sentence Transformers)
118
+
119
+ You can finetune this model on your own dataset.
120
+
121
+ <details><summary>Click to expand</summary>
122
+
123
+ </details>
124
+ -->
125
+
126
+ <!--
127
+ ### Out-of-Scope Use
128
+
129
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
130
+ -->
131
+
132
+ <!--
133
+ ## Bias, Risks and Limitations
134
+
135
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
136
+ -->
137
+
138
+ <!--
139
+ ### Recommendations
140
+
141
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
142
+ -->
143
+
144
+ ## Training Details
145
+
146
+ ### Training Dataset
147
+
148
+ #### json
149
+
150
+ * Dataset: json
151
+ * Size: 225,000 training samples
152
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
153
+ * Approximate statistics based on the first 1000 samples:
154
+ | | anchor | positive | negative |
155
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
156
+ | type | string | string | string |
157
+ | details | <ul><li>min: 14 tokens</li><li>mean: 26.51 tokens</li><li>max: 45 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 15.23 tokens</li><li>max: 27 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 15.34 tokens</li><li>max: 34 tokens</li></ul> |
158
+ * Samples:
159
+ | anchor | positive | negative |
160
+ |:-------------------------------------|:------------------------------|:------------------------------|
161
+ | <code>整个人蜷在阳光里,连毛衣都晒出一股蓬松的香味。</code> | <code>箕踞拥裘坐,半身在日旸。</code> | <code>洛阳女儿对门居,才可容颜十五馀。</code> |
162
+ | <code>好像所有的好事都约好了一样,今天一起找上门来。</code> | <code>临终极乐宝华迎,观音势至俱来至。</code> | <code>身没南朝宅已荒,邑人犹赏旧风光。</code> |
163
+ | <code>大家都觉得她太娇气,只有你一直小心照顾着她。</code> | <code>弱质人皆弃,唯君手自栽。</code> | <code>秦筑长城城已摧,汉武北上单于台。</code> |
164
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
165
+ ```json
166
+ {
167
+ "scale": 20.0,
168
+ "similarity_fct": "cos_sim"
169
+ }
170
+ ```
171
+
172
+ ### Evaluation Dataset
173
+
174
+ #### json
175
+
176
+ * Dataset: json
177
+ * Size: 25,000 evaluation samples
178
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
179
+ * Approximate statistics based on the first 1000 samples:
180
+ | | anchor | positive | negative |
181
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
182
+ | type | string | string | string |
183
+ | details | <ul><li>min: 12 tokens</li><li>mean: 26.86 tokens</li><li>max: 46 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 15.31 tokens</li><li>max: 29 tokens</li></ul> | <ul><li>min: 12 tokens</li><li>mean: 15.3 tokens</li><li>max: 26 tokens</li></ul> |
184
+ * Samples:
185
+ | anchor | positive | negative |
186
+ |:---------------------------------------|:--------------------------|:------------------------------|
187
+ | <code>看着街边那些孤零零的老人,真怕自己以后也变成那样。</code> | <code>垂白乱南翁,委身希北叟。</code> | <code>熏香荀令偏怜少,傅粉何郎不解愁。</code> |
188
+ | <code>关了灯,屋里黑漆漆的,就听见外面秋虫和落叶在说话。</code> | <code>秋虫与秋叶,一夜隔窗闻。</code> | <code>未能穷意义,岂敢求瑕痕。</code> |
189
+ | <code>虽然爷爷不在了,但他教我做人的道理永远记在心里。</code> | <code>惟孝虽遥,灵规不朽。</code> | <code>巧类鸳机织,光攒麝月团。</code> |
190
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
191
+ ```json
192
+ {
193
+ "scale": 20.0,
194
+ "similarity_fct": "cos_sim"
195
+ }
196
+ ```
197
+
198
+ ### Training Hyperparameters
199
+ #### Non-Default Hyperparameters
200
+
201
+ - `eval_strategy`: steps
202
+ - `per_device_train_batch_size`: 128
203
+ - `per_device_eval_batch_size`: 128
204
+ - `learning_rate`: 2e-05
205
+ - `num_train_epochs`: 6
206
+ - `warmup_ratio`: 0.1
207
+ - `fp16`: True
208
+ - `batch_sampler`: no_duplicates
209
+
210
+ #### All Hyperparameters
211
+ <details><summary>Click to expand</summary>
212
+
213
+ - `overwrite_output_dir`: False
214
+ - `do_predict`: False
215
+ - `eval_strategy`: steps
216
+ - `prediction_loss_only`: True
217
+ - `per_device_train_batch_size`: 128
218
+ - `per_device_eval_batch_size`: 128
219
+ - `per_gpu_train_batch_size`: None
220
+ - `per_gpu_eval_batch_size`: None
221
+ - `gradient_accumulation_steps`: 1
222
+ - `eval_accumulation_steps`: None
223
+ - `torch_empty_cache_steps`: None
224
+ - `learning_rate`: 2e-05
225
+ - `weight_decay`: 0.0
226
+ - `adam_beta1`: 0.9
227
+ - `adam_beta2`: 0.999
228
+ - `adam_epsilon`: 1e-08
229
+ - `max_grad_norm`: 1.0
230
+ - `num_train_epochs`: 6
231
+ - `max_steps`: -1
232
+ - `lr_scheduler_type`: linear
233
+ - `lr_scheduler_kwargs`: {}
234
+ - `warmup_ratio`: 0.1
235
+ - `warmup_steps`: 0
236
+ - `log_level`: passive
237
+ - `log_level_replica`: warning
238
+ - `log_on_each_node`: True
239
+ - `logging_nan_inf_filter`: True
240
+ - `save_safetensors`: True
241
+ - `save_on_each_node`: False
242
+ - `save_only_model`: False
243
+ - `restore_callback_states_from_checkpoint`: False
244
+ - `no_cuda`: False
245
+ - `use_cpu`: False
246
+ - `use_mps_device`: False
247
+ - `seed`: 42
248
+ - `data_seed`: None
249
+ - `jit_mode_eval`: False
250
+ - `use_ipex`: False
251
+ - `bf16`: False
252
+ - `fp16`: True
253
+ - `fp16_opt_level`: O1
254
+ - `half_precision_backend`: auto
255
+ - `bf16_full_eval`: False
256
+ - `fp16_full_eval`: False
257
+ - `tf32`: None
258
+ - `local_rank`: 0
259
+ - `ddp_backend`: None
260
+ - `tpu_num_cores`: None
261
+ - `tpu_metrics_debug`: False
262
+ - `debug`: []
263
+ - `dataloader_drop_last`: False
264
+ - `dataloader_num_workers`: 0
265
+ - `dataloader_prefetch_factor`: None
266
+ - `past_index`: -1
267
+ - `disable_tqdm`: False
268
+ - `remove_unused_columns`: True
269
+ - `label_names`: None
270
+ - `load_best_model_at_end`: False
271
+ - `ignore_data_skip`: False
272
+ - `fsdp`: []
273
+ - `fsdp_min_num_params`: 0
274
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
275
+ - `tp_size`: 0
276
+ - `fsdp_transformer_layer_cls_to_wrap`: None
277
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
278
+ - `deepspeed`: None
279
+ - `label_smoothing_factor`: 0.0
280
+ - `optim`: adamw_torch
281
+ - `optim_args`: None
282
+ - `adafactor`: False
283
+ - `group_by_length`: False
284
+ - `length_column_name`: length
285
+ - `ddp_find_unused_parameters`: None
286
+ - `ddp_bucket_cap_mb`: None
287
+ - `ddp_broadcast_buffers`: False
288
+ - `dataloader_pin_memory`: True
289
+ - `dataloader_persistent_workers`: False
290
+ - `skip_memory_metrics`: True
291
+ - `use_legacy_prediction_loop`: False
292
+ - `push_to_hub`: False
293
+ - `resume_from_checkpoint`: None
294
+ - `hub_model_id`: None
295
+ - `hub_strategy`: every_save
296
+ - `hub_private_repo`: None
297
+ - `hub_always_push`: False
298
+ - `gradient_checkpointing`: False
299
+ - `gradient_checkpointing_kwargs`: None
300
+ - `include_inputs_for_metrics`: False
301
+ - `include_for_metrics`: []
302
+ - `eval_do_concat_batches`: True
303
+ - `fp16_backend`: auto
304
+ - `push_to_hub_model_id`: None
305
+ - `push_to_hub_organization`: None
306
+ - `mp_parameters`:
307
+ - `auto_find_batch_size`: False
308
+ - `full_determinism`: False
309
+ - `torchdynamo`: None
310
+ - `ray_scope`: last
311
+ - `ddp_timeout`: 1800
312
+ - `torch_compile`: False
313
+ - `torch_compile_backend`: None
314
+ - `torch_compile_mode`: None
315
+ - `include_tokens_per_second`: False
316
+ - `include_num_input_tokens_seen`: False
317
+ - `neftune_noise_alpha`: None
318
+ - `optim_target_modules`: None
319
+ - `batch_eval_metrics`: False
320
+ - `eval_on_start`: False
321
+ - `use_liger_kernel`: False
322
+ - `eval_use_gather_object`: False
323
+ - `average_tokens_across_devices`: False
324
+ - `prompts`: None
325
+ - `batch_sampler`: no_duplicates
326
+ - `multi_dataset_batch_sampler`: proportional
327
+
328
+ </details>
329
+
330
+ ### Training Logs
331
+ <details><summary>Click to expand</summary>
332
+
333
+ | Epoch | Step | Training Loss | Validation Loss |
334
+ |:------:|:-----:|:-------------:|:---------------:|
335
+ | 0.0284 | 50 | 4.4241 | - |
336
+ | 0.0569 | 100 | 3.4415 | - |
337
+ | 0.0853 | 150 | 2.6725 | - |
338
+ | 0.1138 | 200 | 2.4137 | 2.2686 |
339
+ | 0.1422 | 250 | 2.2701 | - |
340
+ | 0.1706 | 300 | 2.1523 | - |
341
+ | 0.1991 | 350 | 2.0805 | - |
342
+ | 0.2275 | 400 | 2.0513 | 1.9506 |
343
+ | 0.2560 | 450 | 2.0048 | - |
344
+ | 0.2844 | 500 | 1.9552 | - |
345
+ | 0.3129 | 550 | 1.8778 | - |
346
+ | 0.3413 | 600 | 1.8549 | 1.7630 |
347
+ | 0.3697 | 650 | 1.822 | - |
348
+ | 0.3982 | 700 | 1.8128 | - |
349
+ | 0.4266 | 750 | 1.7742 | - |
350
+ | 0.4551 | 800 | 1.7076 | 1.6331 |
351
+ | 0.4835 | 850 | 1.6919 | - |
352
+ | 0.5119 | 900 | 1.64 | - |
353
+ | 0.5404 | 950 | 1.6291 | - |
354
+ | 0.5688 | 1000 | 1.5881 | 1.5368 |
355
+ | 0.5973 | 1050 | 1.6018 | - |
356
+ | 0.6257 | 1100 | 1.5664 | - |
357
+ | 0.6542 | 1150 | 1.5545 | - |
358
+ | 0.6826 | 1200 | 1.5292 | 1.4532 |
359
+ | 0.7110 | 1250 | 1.5166 | - |
360
+ | 0.7395 | 1300 | 1.517 | - |
361
+ | 0.7679 | 1350 | 1.4639 | - |
362
+ | 0.7964 | 1400 | 1.4729 | 1.3687 |
363
+ | 0.8248 | 1450 | 1.4501 | - |
364
+ | 0.8532 | 1500 | 1.3932 | - |
365
+ | 0.8817 | 1550 | 1.4063 | - |
366
+ | 0.9101 | 1600 | 1.3825 | 1.3003 |
367
+ | 0.9386 | 1650 | 1.3647 | - |
368
+ | 0.9670 | 1700 | 1.3431 | - |
369
+ | 0.9954 | 1750 | 1.3417 | - |
370
+ | 1.0239 | 1800 | 1.0839 | 1.2431 |
371
+ | 1.0523 | 1850 | 1.0801 | - |
372
+ | 1.0808 | 1900 | 1.0577 | - |
373
+ | 1.1092 | 1950 | 1.0159 | - |
374
+ | 1.1377 | 2000 | 1.0239 | 1.2132 |
375
+ | 1.1661 | 2050 | 1.0335 | - |
376
+ | 1.1945 | 2100 | 1.0117 | - |
377
+ | 1.2230 | 2150 | 1.0343 | - |
378
+ | 1.2514 | 2200 | 1.0193 | 1.1808 |
379
+ | 1.2799 | 2250 | 1.0235 | - |
380
+ | 1.3083 | 2300 | 0.9949 | - |
381
+ | 1.3367 | 2350 | 1.0058 | - |
382
+ | 1.3652 | 2400 | 1.0039 | 1.1428 |
383
+ | 1.3936 | 2450 | 1.0164 | - |
384
+ | 1.4221 | 2500 | 0.9934 | - |
385
+ | 1.4505 | 2550 | 0.9777 | - |
386
+ | 1.4790 | 2600 | 0.9753 | 1.1101 |
387
+ | 1.5074 | 2650 | 0.9621 | - |
388
+ | 1.5358 | 2700 | 0.9756 | - |
389
+ | 1.5643 | 2750 | 0.9725 | - |
390
+ | 1.5927 | 2800 | 0.9649 | 1.0813 |
391
+ | 1.6212 | 2850 | 0.9652 | - |
392
+ | 1.6496 | 2900 | 0.9861 | - |
393
+ | 1.6780 | 2950 | 0.916 | - |
394
+ | 1.7065 | 3000 | 0.9417 | 1.0523 |
395
+ | 1.7349 | 3050 | 0.9599 | - |
396
+ | 1.7634 | 3100 | 0.9275 | - |
397
+ | 1.7918 | 3150 | 0.9247 | - |
398
+ | 1.8203 | 3200 | 0.9417 | 1.0306 |
399
+ | 1.8487 | 3250 | 0.9275 | - |
400
+ | 1.8771 | 3300 | 0.9431 | - |
401
+ | 1.9056 | 3350 | 0.9147 | - |
402
+ | 1.9340 | 3400 | 0.8957 | 1.0051 |
403
+ | 1.9625 | 3450 | 0.9169 | - |
404
+ | 1.9909 | 3500 | 0.9079 | - |
405
+ | 2.0193 | 3550 | 0.7057 | - |
406
+ | 2.0478 | 3600 | 0.6037 | 0.9944 |
407
+ | 2.0762 | 3650 | 0.5888 | - |
408
+ | 2.1047 | 3700 | 0.6134 | - |
409
+ | 2.1331 | 3750 | 0.6209 | - |
410
+ | 2.1615 | 3800 | 0.6163 | 0.9836 |
411
+ | 2.1900 | 3850 | 0.6271 | - |
412
+ | 2.2184 | 3900 | 0.629 | - |
413
+ | 2.2469 | 3950 | 0.6041 | - |
414
+ | 2.2753 | 4000 | 0.622 | 0.9792 |
415
+ | 2.3038 | 4050 | 0.6175 | - |
416
+ | 2.3322 | 4100 | 0.627 | - |
417
+ | 2.3606 | 4150 | 0.6339 | - |
418
+ | 2.3891 | 4200 | 0.6325 | 0.9643 |
419
+ | 2.4175 | 4250 | 0.6044 | - |
420
+ | 2.4460 | 4300 | 0.6124 | - |
421
+ | 2.4744 | 4350 | 0.6326 | - |
422
+ | 2.5028 | 4400 | 0.6349 | 0.9462 |
423
+ | 2.5313 | 4450 | 0.6286 | - |
424
+ | 2.5597 | 4500 | 0.6325 | - |
425
+ | 2.5882 | 4550 | 0.6399 | - |
426
+ | 2.6166 | 4600 | 0.6184 | 0.9317 |
427
+ | 2.6451 | 4650 | 0.6292 | - |
428
+ | 2.6735 | 4700 | 0.6017 | - |
429
+ | 2.7019 | 4750 | 0.6305 | - |
430
+ | 2.7304 | 4800 | 0.6152 | 0.9213 |
431
+ | 2.7588 | 4850 | 0.5972 | - |
432
+ | 2.7873 | 4900 | 0.6048 | - |
433
+ | 2.8157 | 4950 | 0.6096 | - |
434
+ | 2.8441 | 5000 | 0.6156 | 0.9073 |
435
+ | 2.8726 | 5050 | 0.5942 | - |
436
+ | 2.9010 | 5100 | 0.592 | - |
437
+ | 2.9295 | 5150 | 0.6088 | - |
438
+ | 2.9579 | 5200 | 0.5941 | 0.8950 |
439
+ | 2.9863 | 5250 | 0.6161 | - |
440
+ | 3.0148 | 5300 | 0.5021 | - |
441
+ | 3.0432 | 5350 | 0.4116 | - |
442
+ | 3.0717 | 5400 | 0.3936 | 0.9009 |
443
+ | 3.1001 | 5450 | 0.4193 | - |
444
+ | 3.1286 | 5500 | 0.422 | - |
445
+ | 3.1570 | 5550 | 0.432 | - |
446
+ | 3.1854 | 5600 | 0.4281 | 0.8985 |
447
+ | 3.2139 | 5650 | 0.4091 | - |
448
+ | 3.2423 | 5700 | 0.4305 | - |
449
+ | 3.2708 | 5750 | 0.4203 | - |
450
+ | 3.2992 | 5800 | 0.4193 | 0.8869 |
451
+ | 3.3276 | 5850 | 0.4238 | - |
452
+ | 3.3561 | 5900 | 0.4274 | - |
453
+ | 3.3845 | 5950 | 0.4124 | - |
454
+ | 3.4130 | 6000 | 0.4241 | 0.8842 |
455
+ | 3.4414 | 6050 | 0.427 | - |
456
+ | 3.4699 | 6100 | 0.4275 | - |
457
+ | 3.4983 | 6150 | 0.4152 | - |
458
+ | 3.5267 | 6200 | 0.4247 | 0.8733 |
459
+ | 3.5552 | 6250 | 0.4111 | - |
460
+ | 3.5836 | 6300 | 0.4396 | - |
461
+ | 3.6121 | 6350 | 0.4122 | - |
462
+ | 3.6405 | 6400 | 0.4252 | 0.8657 |
463
+ | 3.6689 | 6450 | 0.4167 | - |
464
+ | 3.6974 | 6500 | 0.4282 | - |
465
+ | 3.7258 | 6550 | 0.411 | - |
466
+ | 3.7543 | 6600 | 0.4273 | 0.8540 |
467
+ | 3.7827 | 6650 | 0.4327 | - |
468
+ | 3.8111 | 6700 | 0.431 | - |
469
+ | 3.8396 | 6750 | 0.4347 | - |
470
+ | 3.8680 | 6800 | 0.4264 | 0.8523 |
471
+ | 3.8965 | 6850 | 0.4213 | - |
472
+ | 3.9249 | 6900 | 0.4285 | - |
473
+ | 3.9534 | 6950 | 0.4138 | - |
474
+ | 3.9818 | 7000 | 0.4051 | 0.8407 |
475
+ | 4.0102 | 7050 | 0.3779 | - |
476
+ | 4.0387 | 7100 | 0.2957 | - |
477
+ | 4.0671 | 7150 | 0.2939 | - |
478
+ | 4.0956 | 7200 | 0.3065 | 0.8590 |
479
+ | 4.1240 | 7250 | 0.3081 | - |
480
+ | 4.1524 | 7300 | 0.3043 | - |
481
+ | 4.1809 | 7350 | 0.3176 | - |
482
+ | 4.2093 | 7400 | 0.3067 | 0.8487 |
483
+ | 4.2378 | 7450 | 0.299 | - |
484
+ | 4.2662 | 7500 | 0.3106 | - |
485
+ | 4.2947 | 7550 | 0.3062 | - |
486
+ | 4.3231 | 7600 | 0.3153 | 0.8498 |
487
+ | 4.3515 | 7650 | 0.3206 | - |
488
+ | 4.3800 | 7700 | 0.3202 | - |
489
+ | 4.4084 | 7750 | 0.3167 | - |
490
+ | 4.4369 | 7800 | 0.3044 | 0.8426 |
491
+ | 4.4653 | 7850 | 0.3015 | - |
492
+ | 4.4937 | 7900 | 0.3157 | - |
493
+ | 4.5222 | 7950 | 0.3109 | - |
494
+ | 4.5506 | 8000 | 0.3164 | 0.8385 |
495
+ | 4.5791 | 8050 | 0.2996 | - |
496
+ | 4.6075 | 8100 | 0.3247 | - |
497
+ | 4.6359 | 8150 | 0.3093 | - |
498
+ | 4.6644 | 8200 | 0.3017 | 0.8294 |
499
+ | 4.6928 | 8250 | 0.3075 | - |
500
+ | 4.7213 | 8300 | 0.3006 | - |
501
+ | 4.7497 | 8350 | 0.3134 | - |
502
+ | 4.7782 | 8400 | 0.3111 | 0.8249 |
503
+ | 4.8066 | 8450 | 0.3165 | - |
504
+ | 4.8350 | 8500 | 0.3071 | - |
505
+ | 4.8635 | 8550 | 0.3017 | - |
506
+ | 4.8919 | 8600 | 0.3092 | 0.8225 |
507
+ | 4.9204 | 8650 | 0.3 | - |
508
+ | 4.9488 | 8700 | 0.2999 | - |
509
+ | 4.9772 | 8750 | 0.3116 | - |
510
+ | 5.0057 | 8800 | 0.3046 | 0.8173 |
511
+ | 5.0341 | 8850 | 0.2501 | - |
512
+ | 5.0626 | 8900 | 0.2443 | - |
513
+ | 5.0910 | 8950 | 0.2338 | - |
514
+ | 5.1195 | 9000 | 0.2382 | 0.8248 |
515
+ | 5.1479 | 9050 | 0.2524 | - |
516
+ | 5.1763 | 9100 | 0.2427 | - |
517
+ | 5.2048 | 9150 | 0.2512 | - |
518
+ | 5.2332 | 9200 | 0.2377 | 0.8218 |
519
+ | 5.2617 | 9250 | 0.2458 | - |
520
+ | 5.2901 | 9300 | 0.2515 | - |
521
+ | 5.3185 | 9350 | 0.2453 | - |
522
+ | 5.3470 | 9400 | 0.244 | 0.8226 |
523
+ | 5.3754 | 9450 | 0.2389 | - |
524
+ | 5.4039 | 9500 | 0.253 | - |
525
+ | 5.4323 | 9550 | 0.2509 | - |
526
+ | 5.4608 | 9600 | 0.2492 | 0.8198 |
527
+ | 5.4892 | 9650 | 0.2379 | - |
528
+ | 5.5176 | 9700 | 0.247 | - |
529
+ | 5.5461 | 9750 | 0.2419 | - |
530
+ | 5.5745 | 9800 | 0.244 | 0.8150 |
531
+ | 5.6030 | 9850 | 0.2498 | - |
532
+ | 5.6314 | 9900 | 0.2381 | - |
533
+ | 5.6598 | 9950 | 0.2425 | - |
534
+ | 5.6883 | 10000 | 0.2451 | 0.8148 |
535
+ | 5.7167 | 10050 | 0.2468 | - |
536
+ | 5.7452 | 10100 | 0.2404 | - |
537
+ | 5.7736 | 10150 | 0.2397 | - |
538
+ | 5.8020 | 10200 | 0.2417 | 0.8124 |
539
+ | 5.8305 | 10250 | 0.2446 | - |
540
+ | 5.8589 | 10300 | 0.2443 | - |
541
+ | 5.8874 | 10350 | 0.2465 | - |
542
+ | 5.9158 | 10400 | 0.2472 | 0.8121 |
543
+
544
+ </details>
545
+
546
+ ### Framework Versions
547
+ - Python: 3.10.16
548
+ - Sentence Transformers: 4.1.0
549
+ - Transformers: 4.51.3
550
+ - PyTorch: 2.7.0+cu126
551
+ - Accelerate: 1.7.0
552
+ - Datasets: 3.6.0
553
+ - Tokenizers: 0.21.1
554
+
555
+ ## Citation
556
+
557
+ ### BibTeX
558
+
559
+ #### Sentence Transformers
560
+ ```bibtex
561
+ @inproceedings{reimers-2019-sentence-bert,
562
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
563
+ author = "Reimers, Nils and Gurevych, Iryna",
564
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
565
+ month = "11",
566
+ year = "2019",
567
+ publisher = "Association for Computational Linguistics",
568
+ url = "https://arxiv.org/abs/1908.10084",
569
+ }
570
+ ```
571
+
572
+ #### MultipleNegativesRankingLoss
573
+ ```bibtex
574
+ @misc{henderson2017efficient,
575
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
576
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
577
+ year={2017},
578
+ eprint={1705.00652},
579
+ archivePrefix={arXiv},
580
+ primaryClass={cs.CL}
581
+ }
582
+ ```
583
+
584
+ <!--
585
+ ## Glossary
586
+
587
+ *Clearly define terms in order to be accessible across audiences.*
588
+ -->
589
+
590
+ <!--
591
+ ## Model Card Authors
592
+
593
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
594
+ -->
595
+
596
+ <!--
597
+ ## Model Card Contact
598
+
599
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
600
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "directionality": "bidi",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.51.3",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 21128
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.3",
5
+ "pytorch": "2.7.0+cu126"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03d42e3c4b09adac3cec1f20cfe457a37a3eb783c8bd647e8da9deb51177078c
3
+ size 1302134568
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Dense",
18
+ "type": "sentence_transformers.models.Dense"
19
+ }
20
+ ]
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cba3d24e972f7db24f5f97ae2f3dc3b3ca7310c82cca6620107eb6abc7a8c886
3
+ size 2610804728
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70a100380b90d5ba0552316471b712696f824e530a39e2db8a14e797a1411411
3
+ size 14645
scaler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b800ccb5ba5d34f5db628b24f67ebfc83905021996eaafd14911414255c4533
3
+ size 1383
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5fdf70e901564afed39d1bb944ba427956b0106eda5fafc834314f0cfd1d93ed
3
+ size 1465
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_length": 512,
51
+ "model_max_length": 512,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
trainer_state.json ADDED
@@ -0,0 +1,1906 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": 0.8121369481086731,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 5.915813424345847,
6
+ "eval_steps": 200,
7
+ "global_step": 10400,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.02844141069397042,
14
+ "grad_norm": 8.889737129211426,
15
+ "learning_rate": 9.099526066350711e-07,
16
+ "loss": 4.4241,
17
+ "step": 50
18
+ },
19
+ {
20
+ "epoch": 0.05688282138794084,
21
+ "grad_norm": 7.543558120727539,
22
+ "learning_rate": 1.8578199052132703e-06,
23
+ "loss": 3.4415,
24
+ "step": 100
25
+ },
26
+ {
27
+ "epoch": 0.08532423208191127,
28
+ "grad_norm": 7.774235725402832,
29
+ "learning_rate": 2.8056872037914696e-06,
30
+ "loss": 2.6725,
31
+ "step": 150
32
+ },
33
+ {
34
+ "epoch": 0.11376564277588168,
35
+ "grad_norm": 7.825632572174072,
36
+ "learning_rate": 3.7535545023696683e-06,
37
+ "loss": 2.4137,
38
+ "step": 200
39
+ },
40
+ {
41
+ "epoch": 0.11376564277588168,
42
+ "eval_loss": 2.2685751914978027,
43
+ "eval_runtime": 29.7449,
44
+ "eval_samples_per_second": 840.481,
45
+ "eval_steps_per_second": 6.589,
46
+ "step": 200
47
+ },
48
+ {
49
+ "epoch": 0.1422070534698521,
50
+ "grad_norm": 8.4616060256958,
51
+ "learning_rate": 4.701421800947868e-06,
52
+ "loss": 2.2701,
53
+ "step": 250
54
+ },
55
+ {
56
+ "epoch": 0.17064846416382254,
57
+ "grad_norm": 7.439651966094971,
58
+ "learning_rate": 5.6492890995260666e-06,
59
+ "loss": 2.1523,
60
+ "step": 300
61
+ },
62
+ {
63
+ "epoch": 0.19908987485779295,
64
+ "grad_norm": 8.319734573364258,
65
+ "learning_rate": 6.597156398104266e-06,
66
+ "loss": 2.0805,
67
+ "step": 350
68
+ },
69
+ {
70
+ "epoch": 0.22753128555176336,
71
+ "grad_norm": 7.824019432067871,
72
+ "learning_rate": 7.545023696682466e-06,
73
+ "loss": 2.0513,
74
+ "step": 400
75
+ },
76
+ {
77
+ "epoch": 0.22753128555176336,
78
+ "eval_loss": 1.9506336450576782,
79
+ "eval_runtime": 28.6984,
80
+ "eval_samples_per_second": 871.127,
81
+ "eval_steps_per_second": 6.83,
82
+ "step": 400
83
+ },
84
+ {
85
+ "epoch": 0.25597269624573377,
86
+ "grad_norm": 8.402134895324707,
87
+ "learning_rate": 8.492890995260664e-06,
88
+ "loss": 2.0048,
89
+ "step": 450
90
+ },
91
+ {
92
+ "epoch": 0.2844141069397042,
93
+ "grad_norm": 7.345431327819824,
94
+ "learning_rate": 9.440758293838863e-06,
95
+ "loss": 1.9552,
96
+ "step": 500
97
+ },
98
+ {
99
+ "epoch": 0.31285551763367464,
100
+ "grad_norm": 8.147149085998535,
101
+ "learning_rate": 1.0388625592417063e-05,
102
+ "loss": 1.8778,
103
+ "step": 550
104
+ },
105
+ {
106
+ "epoch": 0.3412969283276451,
107
+ "grad_norm": 7.802554130554199,
108
+ "learning_rate": 1.133649289099526e-05,
109
+ "loss": 1.8549,
110
+ "step": 600
111
+ },
112
+ {
113
+ "epoch": 0.3412969283276451,
114
+ "eval_loss": 1.7629565000534058,
115
+ "eval_runtime": 33.6232,
116
+ "eval_samples_per_second": 743.534,
117
+ "eval_steps_per_second": 5.829,
118
+ "step": 600
119
+ },
120
+ {
121
+ "epoch": 0.36973833902161546,
122
+ "grad_norm": 7.983552932739258,
123
+ "learning_rate": 1.228436018957346e-05,
124
+ "loss": 1.822,
125
+ "step": 650
126
+ },
127
+ {
128
+ "epoch": 0.3981797497155859,
129
+ "grad_norm": 8.035250663757324,
130
+ "learning_rate": 1.323222748815166e-05,
131
+ "loss": 1.8128,
132
+ "step": 700
133
+ },
134
+ {
135
+ "epoch": 0.42662116040955633,
136
+ "grad_norm": 8.409351348876953,
137
+ "learning_rate": 1.4180094786729858e-05,
138
+ "loss": 1.7742,
139
+ "step": 750
140
+ },
141
+ {
142
+ "epoch": 0.4550625711035267,
143
+ "grad_norm": 7.7319183349609375,
144
+ "learning_rate": 1.5127962085308059e-05,
145
+ "loss": 1.7076,
146
+ "step": 800
147
+ },
148
+ {
149
+ "epoch": 0.4550625711035267,
150
+ "eval_loss": 1.6330854892730713,
151
+ "eval_runtime": 33.0226,
152
+ "eval_samples_per_second": 757.058,
153
+ "eval_steps_per_second": 5.935,
154
+ "step": 800
155
+ },
156
+ {
157
+ "epoch": 0.48350398179749715,
158
+ "grad_norm": 7.466287136077881,
159
+ "learning_rate": 1.6075829383886257e-05,
160
+ "loss": 1.6919,
161
+ "step": 850
162
+ },
163
+ {
164
+ "epoch": 0.5119453924914675,
165
+ "grad_norm": 7.655446529388428,
166
+ "learning_rate": 1.7023696682464458e-05,
167
+ "loss": 1.64,
168
+ "step": 900
169
+ },
170
+ {
171
+ "epoch": 0.540386803185438,
172
+ "grad_norm": 8.173416137695312,
173
+ "learning_rate": 1.7971563981042655e-05,
174
+ "loss": 1.6291,
175
+ "step": 950
176
+ },
177
+ {
178
+ "epoch": 0.5688282138794084,
179
+ "grad_norm": 7.376980781555176,
180
+ "learning_rate": 1.8919431279620855e-05,
181
+ "loss": 1.5881,
182
+ "step": 1000
183
+ },
184
+ {
185
+ "epoch": 0.5688282138794084,
186
+ "eval_loss": 1.5367897748947144,
187
+ "eval_runtime": 32.9799,
188
+ "eval_samples_per_second": 758.038,
189
+ "eval_steps_per_second": 5.943,
190
+ "step": 1000
191
+ },
192
+ {
193
+ "epoch": 0.5972696245733788,
194
+ "grad_norm": 7.863293170928955,
195
+ "learning_rate": 1.9867298578199055e-05,
196
+ "loss": 1.6018,
197
+ "step": 1050
198
+ },
199
+ {
200
+ "epoch": 0.6257110352673493,
201
+ "grad_norm": 7.6200385093688965,
202
+ "learning_rate": 1.9909406931423158e-05,
203
+ "loss": 1.5664,
204
+ "step": 1100
205
+ },
206
+ {
207
+ "epoch": 0.6541524459613197,
208
+ "grad_norm": 8.286286354064941,
209
+ "learning_rate": 1.9804066154008218e-05,
210
+ "loss": 1.5545,
211
+ "step": 1150
212
+ },
213
+ {
214
+ "epoch": 0.6825938566552902,
215
+ "grad_norm": 7.845026969909668,
216
+ "learning_rate": 1.969872537659328e-05,
217
+ "loss": 1.5292,
218
+ "step": 1200
219
+ },
220
+ {
221
+ "epoch": 0.6825938566552902,
222
+ "eval_loss": 1.4531670808792114,
223
+ "eval_runtime": 29.4807,
224
+ "eval_samples_per_second": 848.011,
225
+ "eval_steps_per_second": 6.648,
226
+ "step": 1200
227
+ },
228
+ {
229
+ "epoch": 0.7110352673492605,
230
+ "grad_norm": 7.120193004608154,
231
+ "learning_rate": 1.9593384599178345e-05,
232
+ "loss": 1.5166,
233
+ "step": 1250
234
+ },
235
+ {
236
+ "epoch": 0.7394766780432309,
237
+ "grad_norm": 7.721842288970947,
238
+ "learning_rate": 1.9488043821763408e-05,
239
+ "loss": 1.517,
240
+ "step": 1300
241
+ },
242
+ {
243
+ "epoch": 0.7679180887372014,
244
+ "grad_norm": 7.104468822479248,
245
+ "learning_rate": 1.938270304434847e-05,
246
+ "loss": 1.4639,
247
+ "step": 1350
248
+ },
249
+ {
250
+ "epoch": 0.7963594994311718,
251
+ "grad_norm": 7.570240020751953,
252
+ "learning_rate": 1.927736226693353e-05,
253
+ "loss": 1.4729,
254
+ "step": 1400
255
+ },
256
+ {
257
+ "epoch": 0.7963594994311718,
258
+ "eval_loss": 1.368685245513916,
259
+ "eval_runtime": 28.6992,
260
+ "eval_samples_per_second": 871.103,
261
+ "eval_steps_per_second": 6.829,
262
+ "step": 1400
263
+ },
264
+ {
265
+ "epoch": 0.8248009101251422,
266
+ "grad_norm": 7.745856761932373,
267
+ "learning_rate": 1.9172021489518595e-05,
268
+ "loss": 1.4501,
269
+ "step": 1450
270
+ },
271
+ {
272
+ "epoch": 0.8532423208191127,
273
+ "grad_norm": 7.175948619842529,
274
+ "learning_rate": 1.906668071210366e-05,
275
+ "loss": 1.3932,
276
+ "step": 1500
277
+ },
278
+ {
279
+ "epoch": 0.8816837315130831,
280
+ "grad_norm": 8.291092872619629,
281
+ "learning_rate": 1.8961339934688722e-05,
282
+ "loss": 1.4063,
283
+ "step": 1550
284
+ },
285
+ {
286
+ "epoch": 0.9101251422070534,
287
+ "grad_norm": 7.994405269622803,
288
+ "learning_rate": 1.8855999157273782e-05,
289
+ "loss": 1.3825,
290
+ "step": 1600
291
+ },
292
+ {
293
+ "epoch": 0.9101251422070534,
294
+ "eval_loss": 1.300325632095337,
295
+ "eval_runtime": 28.6638,
296
+ "eval_samples_per_second": 872.179,
297
+ "eval_steps_per_second": 6.838,
298
+ "step": 1600
299
+ },
300
+ {
301
+ "epoch": 0.9385665529010239,
302
+ "grad_norm": 8.009012222290039,
303
+ "learning_rate": 1.8750658379858845e-05,
304
+ "loss": 1.3647,
305
+ "step": 1650
306
+ },
307
+ {
308
+ "epoch": 0.9670079635949943,
309
+ "grad_norm": 8.436450004577637,
310
+ "learning_rate": 1.864531760244391e-05,
311
+ "loss": 1.3431,
312
+ "step": 1700
313
+ },
314
+ {
315
+ "epoch": 0.9954493742889647,
316
+ "grad_norm": 7.547204971313477,
317
+ "learning_rate": 1.8539976825028972e-05,
318
+ "loss": 1.3417,
319
+ "step": 1750
320
+ },
321
+ {
322
+ "epoch": 1.023890784982935,
323
+ "grad_norm": 6.637471675872803,
324
+ "learning_rate": 1.8434636047614032e-05,
325
+ "loss": 1.0839,
326
+ "step": 1800
327
+ },
328
+ {
329
+ "epoch": 1.023890784982935,
330
+ "eval_loss": 1.2430765628814697,
331
+ "eval_runtime": 28.6828,
332
+ "eval_samples_per_second": 871.603,
333
+ "eval_steps_per_second": 6.833,
334
+ "step": 1800
335
+ },
336
+ {
337
+ "epoch": 1.0523321956769056,
338
+ "grad_norm": 7.198896408081055,
339
+ "learning_rate": 1.8329295270199096e-05,
340
+ "loss": 1.0801,
341
+ "step": 1850
342
+ },
343
+ {
344
+ "epoch": 1.080773606370876,
345
+ "grad_norm": 7.391284942626953,
346
+ "learning_rate": 1.8223954492784156e-05,
347
+ "loss": 1.0577,
348
+ "step": 1900
349
+ },
350
+ {
351
+ "epoch": 1.1092150170648465,
352
+ "grad_norm": 6.571183681488037,
353
+ "learning_rate": 1.811861371536922e-05,
354
+ "loss": 1.0159,
355
+ "step": 1950
356
+ },
357
+ {
358
+ "epoch": 1.1376564277588168,
359
+ "grad_norm": 7.20968770980835,
360
+ "learning_rate": 1.8013272937954283e-05,
361
+ "loss": 1.0239,
362
+ "step": 2000
363
+ },
364
+ {
365
+ "epoch": 1.1376564277588168,
366
+ "eval_loss": 1.213191270828247,
367
+ "eval_runtime": 28.5325,
368
+ "eval_samples_per_second": 876.195,
369
+ "eval_steps_per_second": 6.869,
370
+ "step": 2000
371
+ },
372
+ {
373
+ "epoch": 1.1660978384527874,
374
+ "grad_norm": 6.97741174697876,
375
+ "learning_rate": 1.7907932160539346e-05,
376
+ "loss": 1.0335,
377
+ "step": 2050
378
+ },
379
+ {
380
+ "epoch": 1.1945392491467577,
381
+ "grad_norm": 7.157691478729248,
382
+ "learning_rate": 1.7802591383124406e-05,
383
+ "loss": 1.0117,
384
+ "step": 2100
385
+ },
386
+ {
387
+ "epoch": 1.222980659840728,
388
+ "grad_norm": 7.168184280395508,
389
+ "learning_rate": 1.769725060570947e-05,
390
+ "loss": 1.0343,
391
+ "step": 2150
392
+ },
393
+ {
394
+ "epoch": 1.2514220705346986,
395
+ "grad_norm": 7.099086284637451,
396
+ "learning_rate": 1.7591909828294533e-05,
397
+ "loss": 1.0193,
398
+ "step": 2200
399
+ },
400
+ {
401
+ "epoch": 1.2514220705346986,
402
+ "eval_loss": 1.1807738542556763,
403
+ "eval_runtime": 28.5908,
404
+ "eval_samples_per_second": 874.407,
405
+ "eval_steps_per_second": 6.855,
406
+ "step": 2200
407
+ },
408
+ {
409
+ "epoch": 1.2798634812286689,
410
+ "grad_norm": 7.232935905456543,
411
+ "learning_rate": 1.7486569050879597e-05,
412
+ "loss": 1.0235,
413
+ "step": 2250
414
+ },
415
+ {
416
+ "epoch": 1.3083048919226394,
417
+ "grad_norm": 6.775105953216553,
418
+ "learning_rate": 1.738122827346466e-05,
419
+ "loss": 0.9949,
420
+ "step": 2300
421
+ },
422
+ {
423
+ "epoch": 1.3367463026166098,
424
+ "grad_norm": 6.916153430938721,
425
+ "learning_rate": 1.727588749604972e-05,
426
+ "loss": 1.0058,
427
+ "step": 2350
428
+ },
429
+ {
430
+ "epoch": 1.36518771331058,
431
+ "grad_norm": 6.561580181121826,
432
+ "learning_rate": 1.7170546718634784e-05,
433
+ "loss": 1.0039,
434
+ "step": 2400
435
+ },
436
+ {
437
+ "epoch": 1.36518771331058,
438
+ "eval_loss": 1.1427565813064575,
439
+ "eval_runtime": 28.6907,
440
+ "eval_samples_per_second": 871.363,
441
+ "eval_steps_per_second": 6.831,
442
+ "step": 2400
443
+ },
444
+ {
445
+ "epoch": 1.3936291240045506,
446
+ "grad_norm": 6.508544921875,
447
+ "learning_rate": 1.7065205941219847e-05,
448
+ "loss": 1.0164,
449
+ "step": 2450
450
+ },
451
+ {
452
+ "epoch": 1.4220705346985212,
453
+ "grad_norm": 7.889155387878418,
454
+ "learning_rate": 1.695986516380491e-05,
455
+ "loss": 0.9934,
456
+ "step": 2500
457
+ },
458
+ {
459
+ "epoch": 1.4505119453924915,
460
+ "grad_norm": 7.1703782081604,
461
+ "learning_rate": 1.685452438638997e-05,
462
+ "loss": 0.9777,
463
+ "step": 2550
464
+ },
465
+ {
466
+ "epoch": 1.4789533560864618,
467
+ "grad_norm": 7.198650360107422,
468
+ "learning_rate": 1.6749183608975034e-05,
469
+ "loss": 0.9753,
470
+ "step": 2600
471
+ },
472
+ {
473
+ "epoch": 1.4789533560864618,
474
+ "eval_loss": 1.1101032495498657,
475
+ "eval_runtime": 28.9361,
476
+ "eval_samples_per_second": 863.971,
477
+ "eval_steps_per_second": 6.774,
478
+ "step": 2600
479
+ },
480
+ {
481
+ "epoch": 1.5073947667804322,
482
+ "grad_norm": 7.485228061676025,
483
+ "learning_rate": 1.6643842831560098e-05,
484
+ "loss": 0.9621,
485
+ "step": 2650
486
+ },
487
+ {
488
+ "epoch": 1.5358361774744027,
489
+ "grad_norm": 6.426005840301514,
490
+ "learning_rate": 1.653850205414516e-05,
491
+ "loss": 0.9756,
492
+ "step": 2700
493
+ },
494
+ {
495
+ "epoch": 1.5642775881683733,
496
+ "grad_norm": 6.803189277648926,
497
+ "learning_rate": 1.643316127673022e-05,
498
+ "loss": 0.9725,
499
+ "step": 2750
500
+ },
501
+ {
502
+ "epoch": 1.5927189988623436,
503
+ "grad_norm": 7.307713508605957,
504
+ "learning_rate": 1.6327820499315285e-05,
505
+ "loss": 0.9649,
506
+ "step": 2800
507
+ },
508
+ {
509
+ "epoch": 1.5927189988623436,
510
+ "eval_loss": 1.0812790393829346,
511
+ "eval_runtime": 28.8811,
512
+ "eval_samples_per_second": 865.619,
513
+ "eval_steps_per_second": 6.786,
514
+ "step": 2800
515
+ },
516
+ {
517
+ "epoch": 1.621160409556314,
518
+ "grad_norm": 6.56484317779541,
519
+ "learning_rate": 1.6222479721900348e-05,
520
+ "loss": 0.9652,
521
+ "step": 2850
522
+ },
523
+ {
524
+ "epoch": 1.6496018202502845,
525
+ "grad_norm": 6.714264392852783,
526
+ "learning_rate": 1.6117138944485412e-05,
527
+ "loss": 0.9861,
528
+ "step": 2900
529
+ },
530
+ {
531
+ "epoch": 1.6780432309442548,
532
+ "grad_norm": 6.9539642333984375,
533
+ "learning_rate": 1.6011798167070475e-05,
534
+ "loss": 0.916,
535
+ "step": 2950
536
+ },
537
+ {
538
+ "epoch": 1.7064846416382253,
539
+ "grad_norm": 6.552751541137695,
540
+ "learning_rate": 1.5906457389655535e-05,
541
+ "loss": 0.9417,
542
+ "step": 3000
543
+ },
544
+ {
545
+ "epoch": 1.7064846416382253,
546
+ "eval_loss": 1.0522855520248413,
547
+ "eval_runtime": 28.864,
548
+ "eval_samples_per_second": 866.132,
549
+ "eval_steps_per_second": 6.79,
550
+ "step": 3000
551
+ },
552
+ {
553
+ "epoch": 1.7349260523321957,
554
+ "grad_norm": 6.961670875549316,
555
+ "learning_rate": 1.58011166122406e-05,
556
+ "loss": 0.9599,
557
+ "step": 3050
558
+ },
559
+ {
560
+ "epoch": 1.763367463026166,
561
+ "grad_norm": 7.874273300170898,
562
+ "learning_rate": 1.5695775834825662e-05,
563
+ "loss": 0.9275,
564
+ "step": 3100
565
+ },
566
+ {
567
+ "epoch": 1.7918088737201365,
568
+ "grad_norm": 5.82428503036499,
569
+ "learning_rate": 1.5590435057410726e-05,
570
+ "loss": 0.9247,
571
+ "step": 3150
572
+ },
573
+ {
574
+ "epoch": 1.820250284414107,
575
+ "grad_norm": 6.425380706787109,
576
+ "learning_rate": 1.5485094279995786e-05,
577
+ "loss": 0.9417,
578
+ "step": 3200
579
+ },
580
+ {
581
+ "epoch": 1.820250284414107,
582
+ "eval_loss": 1.0305691957473755,
583
+ "eval_runtime": 28.6406,
584
+ "eval_samples_per_second": 872.888,
585
+ "eval_steps_per_second": 6.843,
586
+ "step": 3200
587
+ },
588
+ {
589
+ "epoch": 1.8486916951080774,
590
+ "grad_norm": 6.136819362640381,
591
+ "learning_rate": 1.537975350258085e-05,
592
+ "loss": 0.9275,
593
+ "step": 3250
594
+ },
595
+ {
596
+ "epoch": 1.8771331058020477,
597
+ "grad_norm": 6.463824272155762,
598
+ "learning_rate": 1.5274412725165913e-05,
599
+ "loss": 0.9431,
600
+ "step": 3300
601
+ },
602
+ {
603
+ "epoch": 1.905574516496018,
604
+ "grad_norm": 6.83174467086792,
605
+ "learning_rate": 1.5169071947750974e-05,
606
+ "loss": 0.9147,
607
+ "step": 3350
608
+ },
609
+ {
610
+ "epoch": 1.9340159271899886,
611
+ "grad_norm": 7.504420280456543,
612
+ "learning_rate": 1.5063731170336038e-05,
613
+ "loss": 0.8957,
614
+ "step": 3400
615
+ },
616
+ {
617
+ "epoch": 1.9340159271899886,
618
+ "eval_loss": 1.0050827264785767,
619
+ "eval_runtime": 28.9461,
620
+ "eval_samples_per_second": 863.675,
621
+ "eval_steps_per_second": 6.771,
622
+ "step": 3400
623
+ },
624
+ {
625
+ "epoch": 1.9624573378839592,
626
+ "grad_norm": 7.271299839019775,
627
+ "learning_rate": 1.49583903929211e-05,
628
+ "loss": 0.9169,
629
+ "step": 3450
630
+ },
631
+ {
632
+ "epoch": 1.9908987485779295,
633
+ "grad_norm": 6.796669960021973,
634
+ "learning_rate": 1.4853049615506163e-05,
635
+ "loss": 0.9079,
636
+ "step": 3500
637
+ },
638
+ {
639
+ "epoch": 2.0193401592719,
640
+ "grad_norm": 5.5628180503845215,
641
+ "learning_rate": 1.4747708838091227e-05,
642
+ "loss": 0.7057,
643
+ "step": 3550
644
+ },
645
+ {
646
+ "epoch": 2.04778156996587,
647
+ "grad_norm": 5.777904987335205,
648
+ "learning_rate": 1.4642368060676288e-05,
649
+ "loss": 0.6037,
650
+ "step": 3600
651
+ },
652
+ {
653
+ "epoch": 2.04778156996587,
654
+ "eval_loss": 0.9944195747375488,
655
+ "eval_runtime": 28.8677,
656
+ "eval_samples_per_second": 866.019,
657
+ "eval_steps_per_second": 6.79,
658
+ "step": 3600
659
+ },
660
+ {
661
+ "epoch": 2.076222980659841,
662
+ "grad_norm": 5.112311363220215,
663
+ "learning_rate": 1.4537027283261352e-05,
664
+ "loss": 0.5888,
665
+ "step": 3650
666
+ },
667
+ {
668
+ "epoch": 2.1046643913538112,
669
+ "grad_norm": 6.392485618591309,
670
+ "learning_rate": 1.4431686505846414e-05,
671
+ "loss": 0.6134,
672
+ "step": 3700
673
+ },
674
+ {
675
+ "epoch": 2.1331058020477816,
676
+ "grad_norm": 6.09423303604126,
677
+ "learning_rate": 1.4326345728431477e-05,
678
+ "loss": 0.6209,
679
+ "step": 3750
680
+ },
681
+ {
682
+ "epoch": 2.161547212741752,
683
+ "grad_norm": 6.144412040710449,
684
+ "learning_rate": 1.4221004951016539e-05,
685
+ "loss": 0.6163,
686
+ "step": 3800
687
+ },
688
+ {
689
+ "epoch": 2.161547212741752,
690
+ "eval_loss": 0.9836474061012268,
691
+ "eval_runtime": 28.9354,
692
+ "eval_samples_per_second": 863.993,
693
+ "eval_steps_per_second": 6.774,
694
+ "step": 3800
695
+ },
696
+ {
697
+ "epoch": 2.189988623435722,
698
+ "grad_norm": 5.410032272338867,
699
+ "learning_rate": 1.4115664173601602e-05,
700
+ "loss": 0.6271,
701
+ "step": 3850
702
+ },
703
+ {
704
+ "epoch": 2.218430034129693,
705
+ "grad_norm": 5.688889980316162,
706
+ "learning_rate": 1.4010323396186664e-05,
707
+ "loss": 0.629,
708
+ "step": 3900
709
+ },
710
+ {
711
+ "epoch": 2.2468714448236633,
712
+ "grad_norm": 5.400741100311279,
713
+ "learning_rate": 1.3904982618771728e-05,
714
+ "loss": 0.6041,
715
+ "step": 3950
716
+ },
717
+ {
718
+ "epoch": 2.2753128555176336,
719
+ "grad_norm": 6.409387111663818,
720
+ "learning_rate": 1.379964184135679e-05,
721
+ "loss": 0.622,
722
+ "step": 4000
723
+ },
724
+ {
725
+ "epoch": 2.2753128555176336,
726
+ "eval_loss": 0.9791940450668335,
727
+ "eval_runtime": 29.3397,
728
+ "eval_samples_per_second": 852.088,
729
+ "eval_steps_per_second": 6.68,
730
+ "step": 4000
731
+ },
732
+ {
733
+ "epoch": 2.303754266211604,
734
+ "grad_norm": 5.827444076538086,
735
+ "learning_rate": 1.3694301063941853e-05,
736
+ "loss": 0.6175,
737
+ "step": 4050
738
+ },
739
+ {
740
+ "epoch": 2.3321956769055747,
741
+ "grad_norm": 6.436943054199219,
742
+ "learning_rate": 1.3588960286526916e-05,
743
+ "loss": 0.627,
744
+ "step": 4100
745
+ },
746
+ {
747
+ "epoch": 2.360637087599545,
748
+ "grad_norm": 5.842226028442383,
749
+ "learning_rate": 1.3483619509111978e-05,
750
+ "loss": 0.6339,
751
+ "step": 4150
752
+ },
753
+ {
754
+ "epoch": 2.3890784982935154,
755
+ "grad_norm": 6.457271575927734,
756
+ "learning_rate": 1.3378278731697042e-05,
757
+ "loss": 0.6325,
758
+ "step": 4200
759
+ },
760
+ {
761
+ "epoch": 2.3890784982935154,
762
+ "eval_loss": 0.9643296003341675,
763
+ "eval_runtime": 28.9755,
764
+ "eval_samples_per_second": 862.799,
765
+ "eval_steps_per_second": 6.764,
766
+ "step": 4200
767
+ },
768
+ {
769
+ "epoch": 2.4175199089874857,
770
+ "grad_norm": 6.070743560791016,
771
+ "learning_rate": 1.3272937954282103e-05,
772
+ "loss": 0.6044,
773
+ "step": 4250
774
+ },
775
+ {
776
+ "epoch": 2.445961319681456,
777
+ "grad_norm": 6.5427565574646,
778
+ "learning_rate": 1.3167597176867167e-05,
779
+ "loss": 0.6124,
780
+ "step": 4300
781
+ },
782
+ {
783
+ "epoch": 2.474402730375427,
784
+ "grad_norm": 5.342416286468506,
785
+ "learning_rate": 1.3062256399452229e-05,
786
+ "loss": 0.6326,
787
+ "step": 4350
788
+ },
789
+ {
790
+ "epoch": 2.502844141069397,
791
+ "grad_norm": 5.6298041343688965,
792
+ "learning_rate": 1.2956915622037292e-05,
793
+ "loss": 0.6349,
794
+ "step": 4400
795
+ },
796
+ {
797
+ "epoch": 2.502844141069397,
798
+ "eval_loss": 0.9462358355522156,
799
+ "eval_runtime": 29.0573,
800
+ "eval_samples_per_second": 860.369,
801
+ "eval_steps_per_second": 6.745,
802
+ "step": 4400
803
+ },
804
+ {
805
+ "epoch": 2.5312855517633674,
806
+ "grad_norm": 5.618624210357666,
807
+ "learning_rate": 1.2851574844622354e-05,
808
+ "loss": 0.6286,
809
+ "step": 4450
810
+ },
811
+ {
812
+ "epoch": 2.5597269624573378,
813
+ "grad_norm": 5.629756927490234,
814
+ "learning_rate": 1.2746234067207417e-05,
815
+ "loss": 0.6325,
816
+ "step": 4500
817
+ },
818
+ {
819
+ "epoch": 2.5881683731513085,
820
+ "grad_norm": 5.6407318115234375,
821
+ "learning_rate": 1.2640893289792479e-05,
822
+ "loss": 0.6399,
823
+ "step": 4550
824
+ },
825
+ {
826
+ "epoch": 2.616609783845279,
827
+ "grad_norm": 6.080498695373535,
828
+ "learning_rate": 1.2535552512377542e-05,
829
+ "loss": 0.6184,
830
+ "step": 4600
831
+ },
832
+ {
833
+ "epoch": 2.616609783845279,
834
+ "eval_loss": 0.9317007064819336,
835
+ "eval_runtime": 29.0538,
836
+ "eval_samples_per_second": 860.472,
837
+ "eval_steps_per_second": 6.746,
838
+ "step": 4600
839
+ },
840
+ {
841
+ "epoch": 2.645051194539249,
842
+ "grad_norm": 6.4962239265441895,
843
+ "learning_rate": 1.2430211734962604e-05,
844
+ "loss": 0.6292,
845
+ "step": 4650
846
+ },
847
+ {
848
+ "epoch": 2.6734926052332195,
849
+ "grad_norm": 6.621969223022461,
850
+ "learning_rate": 1.2324870957547668e-05,
851
+ "loss": 0.6017,
852
+ "step": 4700
853
+ },
854
+ {
855
+ "epoch": 2.70193401592719,
856
+ "grad_norm": 5.2126054763793945,
857
+ "learning_rate": 1.2219530180132731e-05,
858
+ "loss": 0.6305,
859
+ "step": 4750
860
+ },
861
+ {
862
+ "epoch": 2.73037542662116,
863
+ "grad_norm": 6.410334587097168,
864
+ "learning_rate": 1.2114189402717793e-05,
865
+ "loss": 0.6152,
866
+ "step": 4800
867
+ },
868
+ {
869
+ "epoch": 2.73037542662116,
870
+ "eval_loss": 0.9212636947631836,
871
+ "eval_runtime": 29.0224,
872
+ "eval_samples_per_second": 861.404,
873
+ "eval_steps_per_second": 6.753,
874
+ "step": 4800
875
+ },
876
+ {
877
+ "epoch": 2.758816837315131,
878
+ "grad_norm": 6.005552291870117,
879
+ "learning_rate": 1.2008848625302856e-05,
880
+ "loss": 0.5972,
881
+ "step": 4850
882
+ },
883
+ {
884
+ "epoch": 2.7872582480091013,
885
+ "grad_norm": 6.479732990264893,
886
+ "learning_rate": 1.1903507847887918e-05,
887
+ "loss": 0.6048,
888
+ "step": 4900
889
+ },
890
+ {
891
+ "epoch": 2.8156996587030716,
892
+ "grad_norm": 6.2526397705078125,
893
+ "learning_rate": 1.1798167070472982e-05,
894
+ "loss": 0.6096,
895
+ "step": 4950
896
+ },
897
+ {
898
+ "epoch": 2.8441410693970424,
899
+ "grad_norm": 6.823054313659668,
900
+ "learning_rate": 1.1692826293058043e-05,
901
+ "loss": 0.6156,
902
+ "step": 5000
903
+ },
904
+ {
905
+ "epoch": 2.8441410693970424,
906
+ "eval_loss": 0.9072502851486206,
907
+ "eval_runtime": 29.0918,
908
+ "eval_samples_per_second": 859.348,
909
+ "eval_steps_per_second": 6.737,
910
+ "step": 5000
911
+ },
912
+ {
913
+ "epoch": 2.8725824800910127,
914
+ "grad_norm": 5.63970422744751,
915
+ "learning_rate": 1.1587485515643107e-05,
916
+ "loss": 0.5942,
917
+ "step": 5050
918
+ },
919
+ {
920
+ "epoch": 2.901023890784983,
921
+ "grad_norm": 5.7269182205200195,
922
+ "learning_rate": 1.1482144738228169e-05,
923
+ "loss": 0.592,
924
+ "step": 5100
925
+ },
926
+ {
927
+ "epoch": 2.9294653014789533,
928
+ "grad_norm": 6.235472202301025,
929
+ "learning_rate": 1.1376803960813232e-05,
930
+ "loss": 0.6088,
931
+ "step": 5150
932
+ },
933
+ {
934
+ "epoch": 2.9579067121729237,
935
+ "grad_norm": 6.49041748046875,
936
+ "learning_rate": 1.1271463183398294e-05,
937
+ "loss": 0.5941,
938
+ "step": 5200
939
+ },
940
+ {
941
+ "epoch": 2.9579067121729237,
942
+ "eval_loss": 0.8950417041778564,
943
+ "eval_runtime": 29.0632,
944
+ "eval_samples_per_second": 860.195,
945
+ "eval_steps_per_second": 6.744,
946
+ "step": 5200
947
+ },
948
+ {
949
+ "epoch": 2.986348122866894,
950
+ "grad_norm": 6.089723587036133,
951
+ "learning_rate": 1.1166122405983357e-05,
952
+ "loss": 0.6161,
953
+ "step": 5250
954
+ },
955
+ {
956
+ "epoch": 3.0147895335608648,
957
+ "grad_norm": 4.977637767791748,
958
+ "learning_rate": 1.1060781628568419e-05,
959
+ "loss": 0.5021,
960
+ "step": 5300
961
+ },
962
+ {
963
+ "epoch": 3.043230944254835,
964
+ "grad_norm": 5.729337215423584,
965
+ "learning_rate": 1.0955440851153483e-05,
966
+ "loss": 0.4116,
967
+ "step": 5350
968
+ },
969
+ {
970
+ "epoch": 3.0716723549488054,
971
+ "grad_norm": 4.303124904632568,
972
+ "learning_rate": 1.0850100073738546e-05,
973
+ "loss": 0.3936,
974
+ "step": 5400
975
+ },
976
+ {
977
+ "epoch": 3.0716723549488054,
978
+ "eval_loss": 0.9009103775024414,
979
+ "eval_runtime": 28.839,
980
+ "eval_samples_per_second": 866.881,
981
+ "eval_steps_per_second": 6.796,
982
+ "step": 5400
983
+ },
984
+ {
985
+ "epoch": 3.1001137656427757,
986
+ "grad_norm": 5.400048732757568,
987
+ "learning_rate": 1.0744759296323608e-05,
988
+ "loss": 0.4193,
989
+ "step": 5450
990
+ },
991
+ {
992
+ "epoch": 3.1285551763367465,
993
+ "grad_norm": 6.018354415893555,
994
+ "learning_rate": 1.0639418518908671e-05,
995
+ "loss": 0.422,
996
+ "step": 5500
997
+ },
998
+ {
999
+ "epoch": 3.156996587030717,
1000
+ "grad_norm": 5.685466766357422,
1001
+ "learning_rate": 1.0534077741493733e-05,
1002
+ "loss": 0.432,
1003
+ "step": 5550
1004
+ },
1005
+ {
1006
+ "epoch": 3.185437997724687,
1007
+ "grad_norm": 5.172823905944824,
1008
+ "learning_rate": 1.0428736964078797e-05,
1009
+ "loss": 0.4281,
1010
+ "step": 5600
1011
+ },
1012
+ {
1013
+ "epoch": 3.185437997724687,
1014
+ "eval_loss": 0.8985010981559753,
1015
+ "eval_runtime": 28.8596,
1016
+ "eval_samples_per_second": 866.262,
1017
+ "eval_steps_per_second": 6.791,
1018
+ "step": 5600
1019
+ },
1020
+ {
1021
+ "epoch": 3.2138794084186575,
1022
+ "grad_norm": 4.836643218994141,
1023
+ "learning_rate": 1.0323396186663858e-05,
1024
+ "loss": 0.4091,
1025
+ "step": 5650
1026
+ },
1027
+ {
1028
+ "epoch": 3.242320819112628,
1029
+ "grad_norm": 5.528740406036377,
1030
+ "learning_rate": 1.0218055409248922e-05,
1031
+ "loss": 0.4305,
1032
+ "step": 5700
1033
+ },
1034
+ {
1035
+ "epoch": 3.2707622298065986,
1036
+ "grad_norm": 4.45158576965332,
1037
+ "learning_rate": 1.0112714631833984e-05,
1038
+ "loss": 0.4203,
1039
+ "step": 5750
1040
+ },
1041
+ {
1042
+ "epoch": 3.299203640500569,
1043
+ "grad_norm": 6.183067798614502,
1044
+ "learning_rate": 1.0007373854419047e-05,
1045
+ "loss": 0.4193,
1046
+ "step": 5800
1047
+ },
1048
+ {
1049
+ "epoch": 3.299203640500569,
1050
+ "eval_loss": 0.8869061470031738,
1051
+ "eval_runtime": 28.6962,
1052
+ "eval_samples_per_second": 871.197,
1053
+ "eval_steps_per_second": 6.83,
1054
+ "step": 5800
1055
+ },
1056
+ {
1057
+ "epoch": 3.3276450511945392,
1058
+ "grad_norm": 5.19403600692749,
1059
+ "learning_rate": 9.902033077004109e-06,
1060
+ "loss": 0.4238,
1061
+ "step": 5850
1062
+ },
1063
+ {
1064
+ "epoch": 3.3560864618885096,
1065
+ "grad_norm": 5.304056644439697,
1066
+ "learning_rate": 9.796692299589172e-06,
1067
+ "loss": 0.4274,
1068
+ "step": 5900
1069
+ },
1070
+ {
1071
+ "epoch": 3.3845278725824803,
1072
+ "grad_norm": 4.698873519897461,
1073
+ "learning_rate": 9.691351522174236e-06,
1074
+ "loss": 0.4124,
1075
+ "step": 5950
1076
+ },
1077
+ {
1078
+ "epoch": 3.4129692832764507,
1079
+ "grad_norm": 5.627292156219482,
1080
+ "learning_rate": 9.586010744759297e-06,
1081
+ "loss": 0.4241,
1082
+ "step": 6000
1083
+ },
1084
+ {
1085
+ "epoch": 3.4129692832764507,
1086
+ "eval_loss": 0.8842443823814392,
1087
+ "eval_runtime": 28.6817,
1088
+ "eval_samples_per_second": 871.636,
1089
+ "eval_steps_per_second": 6.834,
1090
+ "step": 6000
1091
+ },
1092
+ {
1093
+ "epoch": 3.441410693970421,
1094
+ "grad_norm": 6.473363876342773,
1095
+ "learning_rate": 9.480669967344361e-06,
1096
+ "loss": 0.427,
1097
+ "step": 6050
1098
+ },
1099
+ {
1100
+ "epoch": 3.4698521046643913,
1101
+ "grad_norm": 4.9653801918029785,
1102
+ "learning_rate": 9.375329189929423e-06,
1103
+ "loss": 0.4275,
1104
+ "step": 6100
1105
+ },
1106
+ {
1107
+ "epoch": 3.4982935153583616,
1108
+ "grad_norm": 4.9852294921875,
1109
+ "learning_rate": 9.269988412514486e-06,
1110
+ "loss": 0.4152,
1111
+ "step": 6150
1112
+ },
1113
+ {
1114
+ "epoch": 3.526734926052332,
1115
+ "grad_norm": 5.868428707122803,
1116
+ "learning_rate": 9.164647635099548e-06,
1117
+ "loss": 0.4247,
1118
+ "step": 6200
1119
+ },
1120
+ {
1121
+ "epoch": 3.526734926052332,
1122
+ "eval_loss": 0.8732792139053345,
1123
+ "eval_runtime": 28.8814,
1124
+ "eval_samples_per_second": 865.608,
1125
+ "eval_steps_per_second": 6.786,
1126
+ "step": 6200
1127
+ },
1128
+ {
1129
+ "epoch": 3.5551763367463027,
1130
+ "grad_norm": 5.333588600158691,
1131
+ "learning_rate": 9.05930685768461e-06,
1132
+ "loss": 0.4111,
1133
+ "step": 6250
1134
+ },
1135
+ {
1136
+ "epoch": 3.583617747440273,
1137
+ "grad_norm": 5.569532871246338,
1138
+ "learning_rate": 8.953966080269673e-06,
1139
+ "loss": 0.4396,
1140
+ "step": 6300
1141
+ },
1142
+ {
1143
+ "epoch": 3.6120591581342434,
1144
+ "grad_norm": 5.38419771194458,
1145
+ "learning_rate": 8.848625302854735e-06,
1146
+ "loss": 0.4122,
1147
+ "step": 6350
1148
+ },
1149
+ {
1150
+ "epoch": 3.640500568828214,
1151
+ "grad_norm": 5.328497409820557,
1152
+ "learning_rate": 8.743284525439798e-06,
1153
+ "loss": 0.4252,
1154
+ "step": 6400
1155
+ },
1156
+ {
1157
+ "epoch": 3.640500568828214,
1158
+ "eval_loss": 0.8656958937644958,
1159
+ "eval_runtime": 28.751,
1160
+ "eval_samples_per_second": 869.534,
1161
+ "eval_steps_per_second": 6.817,
1162
+ "step": 6400
1163
+ },
1164
+ {
1165
+ "epoch": 3.6689419795221845,
1166
+ "grad_norm": 5.675217151641846,
1167
+ "learning_rate": 8.63794374802486e-06,
1168
+ "loss": 0.4167,
1169
+ "step": 6450
1170
+ },
1171
+ {
1172
+ "epoch": 3.697383390216155,
1173
+ "grad_norm": 5.26973295211792,
1174
+ "learning_rate": 8.532602970609924e-06,
1175
+ "loss": 0.4282,
1176
+ "step": 6500
1177
+ },
1178
+ {
1179
+ "epoch": 3.725824800910125,
1180
+ "grad_norm": 5.991490840911865,
1181
+ "learning_rate": 8.427262193194985e-06,
1182
+ "loss": 0.411,
1183
+ "step": 6550
1184
+ },
1185
+ {
1186
+ "epoch": 3.7542662116040955,
1187
+ "grad_norm": 5.413957118988037,
1188
+ "learning_rate": 8.321921415780049e-06,
1189
+ "loss": 0.4273,
1190
+ "step": 6600
1191
+ },
1192
+ {
1193
+ "epoch": 3.7542662116040955,
1194
+ "eval_loss": 0.8539847135543823,
1195
+ "eval_runtime": 28.8669,
1196
+ "eval_samples_per_second": 866.045,
1197
+ "eval_steps_per_second": 6.79,
1198
+ "step": 6600
1199
+ },
1200
+ {
1201
+ "epoch": 3.782707622298066,
1202
+ "grad_norm": 5.672956466674805,
1203
+ "learning_rate": 8.21658063836511e-06,
1204
+ "loss": 0.4327,
1205
+ "step": 6650
1206
+ },
1207
+ {
1208
+ "epoch": 3.8111490329920366,
1209
+ "grad_norm": 6.0553059577941895,
1210
+ "learning_rate": 8.111239860950174e-06,
1211
+ "loss": 0.431,
1212
+ "step": 6700
1213
+ },
1214
+ {
1215
+ "epoch": 3.839590443686007,
1216
+ "grad_norm": 6.111351013183594,
1217
+ "learning_rate": 8.005899083535238e-06,
1218
+ "loss": 0.4347,
1219
+ "step": 6750
1220
+ },
1221
+ {
1222
+ "epoch": 3.868031854379977,
1223
+ "grad_norm": 6.185035705566406,
1224
+ "learning_rate": 7.9005583061203e-06,
1225
+ "loss": 0.4264,
1226
+ "step": 6800
1227
+ },
1228
+ {
1229
+ "epoch": 3.868031854379977,
1230
+ "eval_loss": 0.8523036241531372,
1231
+ "eval_runtime": 28.7415,
1232
+ "eval_samples_per_second": 869.823,
1233
+ "eval_steps_per_second": 6.819,
1234
+ "step": 6800
1235
+ },
1236
+ {
1237
+ "epoch": 3.8964732650739475,
1238
+ "grad_norm": 4.952618598937988,
1239
+ "learning_rate": 7.795217528705363e-06,
1240
+ "loss": 0.4213,
1241
+ "step": 6850
1242
+ },
1243
+ {
1244
+ "epoch": 3.9249146757679183,
1245
+ "grad_norm": 5.168086528778076,
1246
+ "learning_rate": 7.689876751290425e-06,
1247
+ "loss": 0.4285,
1248
+ "step": 6900
1249
+ },
1250
+ {
1251
+ "epoch": 3.9533560864618886,
1252
+ "grad_norm": 5.6217732429504395,
1253
+ "learning_rate": 7.584535973875487e-06,
1254
+ "loss": 0.4138,
1255
+ "step": 6950
1256
+ },
1257
+ {
1258
+ "epoch": 3.981797497155859,
1259
+ "grad_norm": 4.983550548553467,
1260
+ "learning_rate": 7.47919519646055e-06,
1261
+ "loss": 0.4051,
1262
+ "step": 7000
1263
+ },
1264
+ {
1265
+ "epoch": 3.981797497155859,
1266
+ "eval_loss": 0.8406953811645508,
1267
+ "eval_runtime": 28.8132,
1268
+ "eval_samples_per_second": 867.659,
1269
+ "eval_steps_per_second": 6.802,
1270
+ "step": 7000
1271
+ },
1272
+ {
1273
+ "epoch": 4.010238907849829,
1274
+ "grad_norm": 3.829274892807007,
1275
+ "learning_rate": 7.373854419045613e-06,
1276
+ "loss": 0.3779,
1277
+ "step": 7050
1278
+ },
1279
+ {
1280
+ "epoch": 4.0386803185438,
1281
+ "grad_norm": 4.154295921325684,
1282
+ "learning_rate": 7.268513641630676e-06,
1283
+ "loss": 0.2957,
1284
+ "step": 7100
1285
+ },
1286
+ {
1287
+ "epoch": 4.06712172923777,
1288
+ "grad_norm": 5.0097222328186035,
1289
+ "learning_rate": 7.1631728642157386e-06,
1290
+ "loss": 0.2939,
1291
+ "step": 7150
1292
+ },
1293
+ {
1294
+ "epoch": 4.09556313993174,
1295
+ "grad_norm": 5.015048027038574,
1296
+ "learning_rate": 7.057832086800801e-06,
1297
+ "loss": 0.3065,
1298
+ "step": 7200
1299
+ },
1300
+ {
1301
+ "epoch": 4.09556313993174,
1302
+ "eval_loss": 0.8590184450149536,
1303
+ "eval_runtime": 28.7607,
1304
+ "eval_samples_per_second": 869.241,
1305
+ "eval_steps_per_second": 6.815,
1306
+ "step": 7200
1307
+ },
1308
+ {
1309
+ "epoch": 4.1240045506257115,
1310
+ "grad_norm": 4.9901018142700195,
1311
+ "learning_rate": 6.952491309385864e-06,
1312
+ "loss": 0.3081,
1313
+ "step": 7250
1314
+ },
1315
+ {
1316
+ "epoch": 4.152445961319682,
1317
+ "grad_norm": 4.8424391746521,
1318
+ "learning_rate": 6.847150531970926e-06,
1319
+ "loss": 0.3043,
1320
+ "step": 7300
1321
+ },
1322
+ {
1323
+ "epoch": 4.180887372013652,
1324
+ "grad_norm": 5.147951602935791,
1325
+ "learning_rate": 6.741809754555989e-06,
1326
+ "loss": 0.3176,
1327
+ "step": 7350
1328
+ },
1329
+ {
1330
+ "epoch": 4.2093287827076225,
1331
+ "grad_norm": 4.292293548583984,
1332
+ "learning_rate": 6.636468977141052e-06,
1333
+ "loss": 0.3067,
1334
+ "step": 7400
1335
+ },
1336
+ {
1337
+ "epoch": 4.2093287827076225,
1338
+ "eval_loss": 0.848746657371521,
1339
+ "eval_runtime": 29.0524,
1340
+ "eval_samples_per_second": 860.514,
1341
+ "eval_steps_per_second": 6.746,
1342
+ "step": 7400
1343
+ },
1344
+ {
1345
+ "epoch": 4.237770193401593,
1346
+ "grad_norm": 4.796692848205566,
1347
+ "learning_rate": 6.531128199726114e-06,
1348
+ "loss": 0.299,
1349
+ "step": 7450
1350
+ },
1351
+ {
1352
+ "epoch": 4.266211604095563,
1353
+ "grad_norm": 5.196813583374023,
1354
+ "learning_rate": 6.425787422311177e-06,
1355
+ "loss": 0.3106,
1356
+ "step": 7500
1357
+ },
1358
+ {
1359
+ "epoch": 4.294653014789533,
1360
+ "grad_norm": 4.551479816436768,
1361
+ "learning_rate": 6.3204466448962395e-06,
1362
+ "loss": 0.3062,
1363
+ "step": 7550
1364
+ },
1365
+ {
1366
+ "epoch": 4.323094425483504,
1367
+ "grad_norm": 4.6921257972717285,
1368
+ "learning_rate": 6.215105867481302e-06,
1369
+ "loss": 0.3153,
1370
+ "step": 7600
1371
+ },
1372
+ {
1373
+ "epoch": 4.323094425483504,
1374
+ "eval_loss": 0.8497870564460754,
1375
+ "eval_runtime": 29.0027,
1376
+ "eval_samples_per_second": 861.988,
1377
+ "eval_steps_per_second": 6.758,
1378
+ "step": 7600
1379
+ },
1380
+ {
1381
+ "epoch": 4.351535836177474,
1382
+ "grad_norm": 4.535303592681885,
1383
+ "learning_rate": 6.109765090066366e-06,
1384
+ "loss": 0.3206,
1385
+ "step": 7650
1386
+ },
1387
+ {
1388
+ "epoch": 4.379977246871444,
1389
+ "grad_norm": 5.174567222595215,
1390
+ "learning_rate": 6.004424312651428e-06,
1391
+ "loss": 0.3202,
1392
+ "step": 7700
1393
+ },
1394
+ {
1395
+ "epoch": 4.408418657565416,
1396
+ "grad_norm": 4.402812480926514,
1397
+ "learning_rate": 5.899083535236491e-06,
1398
+ "loss": 0.3167,
1399
+ "step": 7750
1400
+ },
1401
+ {
1402
+ "epoch": 4.436860068259386,
1403
+ "grad_norm": 4.917297840118408,
1404
+ "learning_rate": 5.7937427578215534e-06,
1405
+ "loss": 0.3044,
1406
+ "step": 7800
1407
+ },
1408
+ {
1409
+ "epoch": 4.436860068259386,
1410
+ "eval_loss": 0.8426228165626526,
1411
+ "eval_runtime": 29.2233,
1412
+ "eval_samples_per_second": 855.482,
1413
+ "eval_steps_per_second": 6.707,
1414
+ "step": 7800
1415
+ },
1416
+ {
1417
+ "epoch": 4.465301478953356,
1418
+ "grad_norm": 5.476150989532471,
1419
+ "learning_rate": 5.688401980406616e-06,
1420
+ "loss": 0.3015,
1421
+ "step": 7850
1422
+ },
1423
+ {
1424
+ "epoch": 4.493742889647327,
1425
+ "grad_norm": 5.594091415405273,
1426
+ "learning_rate": 5.583061202991679e-06,
1427
+ "loss": 0.3157,
1428
+ "step": 7900
1429
+ },
1430
+ {
1431
+ "epoch": 4.522184300341297,
1432
+ "grad_norm": 4.798509120941162,
1433
+ "learning_rate": 5.477720425576741e-06,
1434
+ "loss": 0.3109,
1435
+ "step": 7950
1436
+ },
1437
+ {
1438
+ "epoch": 4.550625711035267,
1439
+ "grad_norm": 4.705766201019287,
1440
+ "learning_rate": 5.372379648161804e-06,
1441
+ "loss": 0.3164,
1442
+ "step": 8000
1443
+ },
1444
+ {
1445
+ "epoch": 4.550625711035267,
1446
+ "eval_loss": 0.8384647369384766,
1447
+ "eval_runtime": 29.0223,
1448
+ "eval_samples_per_second": 861.406,
1449
+ "eval_steps_per_second": 6.753,
1450
+ "step": 8000
1451
+ },
1452
+ {
1453
+ "epoch": 4.579067121729238,
1454
+ "grad_norm": 5.214234352111816,
1455
+ "learning_rate": 5.269145686295165e-06,
1456
+ "loss": 0.2996,
1457
+ "step": 8050
1458
+ },
1459
+ {
1460
+ "epoch": 4.607508532423208,
1461
+ "grad_norm": 3.9629294872283936,
1462
+ "learning_rate": 5.163804908880228e-06,
1463
+ "loss": 0.3247,
1464
+ "step": 8100
1465
+ },
1466
+ {
1467
+ "epoch": 4.635949943117178,
1468
+ "grad_norm": 5.35923957824707,
1469
+ "learning_rate": 5.058464131465291e-06,
1470
+ "loss": 0.3093,
1471
+ "step": 8150
1472
+ },
1473
+ {
1474
+ "epoch": 4.664391353811149,
1475
+ "grad_norm": 4.924727916717529,
1476
+ "learning_rate": 4.9531233540503534e-06,
1477
+ "loss": 0.3017,
1478
+ "step": 8200
1479
+ },
1480
+ {
1481
+ "epoch": 4.664391353811149,
1482
+ "eval_loss": 0.8293972611427307,
1483
+ "eval_runtime": 29.0332,
1484
+ "eval_samples_per_second": 861.084,
1485
+ "eval_steps_per_second": 6.751,
1486
+ "step": 8200
1487
+ },
1488
+ {
1489
+ "epoch": 4.69283276450512,
1490
+ "grad_norm": 4.929891586303711,
1491
+ "learning_rate": 4.847782576635416e-06,
1492
+ "loss": 0.3075,
1493
+ "step": 8250
1494
+ },
1495
+ {
1496
+ "epoch": 4.72127417519909,
1497
+ "grad_norm": 4.345849514007568,
1498
+ "learning_rate": 4.742441799220479e-06,
1499
+ "loss": 0.3006,
1500
+ "step": 8300
1501
+ },
1502
+ {
1503
+ "epoch": 4.74971558589306,
1504
+ "grad_norm": 4.58878231048584,
1505
+ "learning_rate": 4.637101021805541e-06,
1506
+ "loss": 0.3134,
1507
+ "step": 8350
1508
+ },
1509
+ {
1510
+ "epoch": 4.778156996587031,
1511
+ "grad_norm": 5.448882579803467,
1512
+ "learning_rate": 4.531760244390604e-06,
1513
+ "loss": 0.3111,
1514
+ "step": 8400
1515
+ },
1516
+ {
1517
+ "epoch": 4.778156996587031,
1518
+ "eval_loss": 0.8249350786209106,
1519
+ "eval_runtime": 29.1624,
1520
+ "eval_samples_per_second": 857.269,
1521
+ "eval_steps_per_second": 6.721,
1522
+ "step": 8400
1523
+ },
1524
+ {
1525
+ "epoch": 4.806598407281001,
1526
+ "grad_norm": 4.381404399871826,
1527
+ "learning_rate": 4.4264194669756665e-06,
1528
+ "loss": 0.3165,
1529
+ "step": 8450
1530
+ },
1531
+ {
1532
+ "epoch": 4.835039817974971,
1533
+ "grad_norm": 4.86619234085083,
1534
+ "learning_rate": 4.321078689560729e-06,
1535
+ "loss": 0.3071,
1536
+ "step": 8500
1537
+ },
1538
+ {
1539
+ "epoch": 4.863481228668942,
1540
+ "grad_norm": 5.313292503356934,
1541
+ "learning_rate": 4.215737912145792e-06,
1542
+ "loss": 0.3017,
1543
+ "step": 8550
1544
+ },
1545
+ {
1546
+ "epoch": 4.891922639362912,
1547
+ "grad_norm": 4.802574157714844,
1548
+ "learning_rate": 4.110397134730854e-06,
1549
+ "loss": 0.3092,
1550
+ "step": 8600
1551
+ },
1552
+ {
1553
+ "epoch": 4.891922639362912,
1554
+ "eval_loss": 0.8224520087242126,
1555
+ "eval_runtime": 29.0511,
1556
+ "eval_samples_per_second": 860.551,
1557
+ "eval_steps_per_second": 6.747,
1558
+ "step": 8600
1559
+ },
1560
+ {
1561
+ "epoch": 4.920364050056882,
1562
+ "grad_norm": 5.428598880767822,
1563
+ "learning_rate": 4.005056357315917e-06,
1564
+ "loss": 0.3,
1565
+ "step": 8650
1566
+ },
1567
+ {
1568
+ "epoch": 4.948805460750854,
1569
+ "grad_norm": 5.6783528327941895,
1570
+ "learning_rate": 3.8997155799009805e-06,
1571
+ "loss": 0.2999,
1572
+ "step": 8700
1573
+ },
1574
+ {
1575
+ "epoch": 4.977246871444824,
1576
+ "grad_norm": 5.2957940101623535,
1577
+ "learning_rate": 3.7943748024860427e-06,
1578
+ "loss": 0.3116,
1579
+ "step": 8750
1580
+ },
1581
+ {
1582
+ "epoch": 5.005688282138794,
1583
+ "grad_norm": 4.1276631355285645,
1584
+ "learning_rate": 3.6890340250711053e-06,
1585
+ "loss": 0.3046,
1586
+ "step": 8800
1587
+ },
1588
+ {
1589
+ "epoch": 5.005688282138794,
1590
+ "eval_loss": 0.8173409700393677,
1591
+ "eval_runtime": 28.9634,
1592
+ "eval_samples_per_second": 863.157,
1593
+ "eval_steps_per_second": 6.767,
1594
+ "step": 8800
1595
+ },
1596
+ {
1597
+ "epoch": 5.034129692832765,
1598
+ "grad_norm": 4.093660354614258,
1599
+ "learning_rate": 3.5836932476561683e-06,
1600
+ "loss": 0.2501,
1601
+ "step": 8850
1602
+ },
1603
+ {
1604
+ "epoch": 5.062571103526735,
1605
+ "grad_norm": 5.549435615539551,
1606
+ "learning_rate": 3.478352470241231e-06,
1607
+ "loss": 0.2443,
1608
+ "step": 8900
1609
+ },
1610
+ {
1611
+ "epoch": 5.091012514220705,
1612
+ "grad_norm": 4.558211803436279,
1613
+ "learning_rate": 3.3730116928262936e-06,
1614
+ "loss": 0.2338,
1615
+ "step": 8950
1616
+ },
1617
+ {
1618
+ "epoch": 5.1194539249146755,
1619
+ "grad_norm": 3.450760841369629,
1620
+ "learning_rate": 3.267670915411356e-06,
1621
+ "loss": 0.2382,
1622
+ "step": 9000
1623
+ },
1624
+ {
1625
+ "epoch": 5.1194539249146755,
1626
+ "eval_loss": 0.8248207569122314,
1627
+ "eval_runtime": 29.0514,
1628
+ "eval_samples_per_second": 860.545,
1629
+ "eval_steps_per_second": 6.747,
1630
+ "step": 9000
1631
+ },
1632
+ {
1633
+ "epoch": 5.147895335608646,
1634
+ "grad_norm": 4.0541205406188965,
1635
+ "learning_rate": 3.162330137996419e-06,
1636
+ "loss": 0.2524,
1637
+ "step": 9050
1638
+ },
1639
+ {
1640
+ "epoch": 5.176336746302616,
1641
+ "grad_norm": 4.376137733459473,
1642
+ "learning_rate": 3.0569893605814814e-06,
1643
+ "loss": 0.2427,
1644
+ "step": 9100
1645
+ },
1646
+ {
1647
+ "epoch": 5.204778156996587,
1648
+ "grad_norm": 4.169808864593506,
1649
+ "learning_rate": 2.951648583166544e-06,
1650
+ "loss": 0.2512,
1651
+ "step": 9150
1652
+ },
1653
+ {
1654
+ "epoch": 5.233219567690558,
1655
+ "grad_norm": 4.089740753173828,
1656
+ "learning_rate": 2.846307805751607e-06,
1657
+ "loss": 0.2377,
1658
+ "step": 9200
1659
+ },
1660
+ {
1661
+ "epoch": 5.233219567690558,
1662
+ "eval_loss": 0.8218184113502502,
1663
+ "eval_runtime": 28.9027,
1664
+ "eval_samples_per_second": 864.97,
1665
+ "eval_steps_per_second": 6.781,
1666
+ "step": 9200
1667
+ },
1668
+ {
1669
+ "epoch": 5.261660978384528,
1670
+ "grad_norm": 4.028066635131836,
1671
+ "learning_rate": 2.7409670283366697e-06,
1672
+ "loss": 0.2458,
1673
+ "step": 9250
1674
+ },
1675
+ {
1676
+ "epoch": 5.290102389078498,
1677
+ "grad_norm": 5.62259578704834,
1678
+ "learning_rate": 2.635626250921732e-06,
1679
+ "loss": 0.2515,
1680
+ "step": 9300
1681
+ },
1682
+ {
1683
+ "epoch": 5.318543799772469,
1684
+ "grad_norm": 4.931870937347412,
1685
+ "learning_rate": 2.5302854735067945e-06,
1686
+ "loss": 0.2453,
1687
+ "step": 9350
1688
+ },
1689
+ {
1690
+ "epoch": 5.346985210466439,
1691
+ "grad_norm": 4.307934284210205,
1692
+ "learning_rate": 2.4249446960918575e-06,
1693
+ "loss": 0.244,
1694
+ "step": 9400
1695
+ },
1696
+ {
1697
+ "epoch": 5.346985210466439,
1698
+ "eval_loss": 0.8225930333137512,
1699
+ "eval_runtime": 28.8011,
1700
+ "eval_samples_per_second": 868.022,
1701
+ "eval_steps_per_second": 6.805,
1702
+ "step": 9400
1703
+ },
1704
+ {
1705
+ "epoch": 5.375426621160409,
1706
+ "grad_norm": 3.650233030319214,
1707
+ "learning_rate": 2.31960391867692e-06,
1708
+ "loss": 0.2389,
1709
+ "step": 9450
1710
+ },
1711
+ {
1712
+ "epoch": 5.40386803185438,
1713
+ "grad_norm": 4.171177864074707,
1714
+ "learning_rate": 2.2142631412619828e-06,
1715
+ "loss": 0.253,
1716
+ "step": 9500
1717
+ },
1718
+ {
1719
+ "epoch": 5.43230944254835,
1720
+ "grad_norm": 5.055683135986328,
1721
+ "learning_rate": 2.1089223638470454e-06,
1722
+ "loss": 0.2509,
1723
+ "step": 9550
1724
+ },
1725
+ {
1726
+ "epoch": 5.460750853242321,
1727
+ "grad_norm": 4.621593952178955,
1728
+ "learning_rate": 2.003581586432108e-06,
1729
+ "loss": 0.2492,
1730
+ "step": 9600
1731
+ },
1732
+ {
1733
+ "epoch": 5.460750853242321,
1734
+ "eval_loss": 0.8198309540748596,
1735
+ "eval_runtime": 28.7042,
1736
+ "eval_samples_per_second": 870.954,
1737
+ "eval_steps_per_second": 6.828,
1738
+ "step": 9600
1739
+ },
1740
+ {
1741
+ "epoch": 5.489192263936292,
1742
+ "grad_norm": 5.461741924285889,
1743
+ "learning_rate": 1.8982408090171708e-06,
1744
+ "loss": 0.2379,
1745
+ "step": 9650
1746
+ },
1747
+ {
1748
+ "epoch": 5.517633674630262,
1749
+ "grad_norm": 4.083144664764404,
1750
+ "learning_rate": 1.7929000316022333e-06,
1751
+ "loss": 0.247,
1752
+ "step": 9700
1753
+ },
1754
+ {
1755
+ "epoch": 5.546075085324232,
1756
+ "grad_norm": 4.508319854736328,
1757
+ "learning_rate": 1.6875592541872959e-06,
1758
+ "loss": 0.2419,
1759
+ "step": 9750
1760
+ },
1761
+ {
1762
+ "epoch": 5.5745164960182025,
1763
+ "grad_norm": 4.420298099517822,
1764
+ "learning_rate": 1.5822184767723587e-06,
1765
+ "loss": 0.244,
1766
+ "step": 9800
1767
+ },
1768
+ {
1769
+ "epoch": 5.5745164960182025,
1770
+ "eval_loss": 0.8149560689926147,
1771
+ "eval_runtime": 28.7025,
1772
+ "eval_samples_per_second": 871.004,
1773
+ "eval_steps_per_second": 6.829,
1774
+ "step": 9800
1775
+ },
1776
+ {
1777
+ "epoch": 5.602957906712173,
1778
+ "grad_norm": 4.702558517456055,
1779
+ "learning_rate": 1.4768776993574213e-06,
1780
+ "loss": 0.2498,
1781
+ "step": 9850
1782
+ },
1783
+ {
1784
+ "epoch": 5.631399317406143,
1785
+ "grad_norm": 3.864471912384033,
1786
+ "learning_rate": 1.371536921942484e-06,
1787
+ "loss": 0.2381,
1788
+ "step": 9900
1789
+ },
1790
+ {
1791
+ "epoch": 5.6598407281001135,
1792
+ "grad_norm": 4.41420316696167,
1793
+ "learning_rate": 1.2661961445275468e-06,
1794
+ "loss": 0.2425,
1795
+ "step": 9950
1796
+ },
1797
+ {
1798
+ "epoch": 5.688282138794084,
1799
+ "grad_norm": 4.402945041656494,
1800
+ "learning_rate": 1.1608553671126094e-06,
1801
+ "loss": 0.2451,
1802
+ "step": 10000
1803
+ },
1804
+ {
1805
+ "epoch": 5.688282138794084,
1806
+ "eval_loss": 0.8147642016410828,
1807
+ "eval_runtime": 28.751,
1808
+ "eval_samples_per_second": 869.534,
1809
+ "eval_steps_per_second": 6.817,
1810
+ "step": 10000
1811
+ },
1812
+ {
1813
+ "epoch": 5.716723549488055,
1814
+ "grad_norm": 4.66687536239624,
1815
+ "learning_rate": 1.055514589697672e-06,
1816
+ "loss": 0.2468,
1817
+ "step": 10050
1818
+ },
1819
+ {
1820
+ "epoch": 5.745164960182025,
1821
+ "grad_norm": 4.6121649742126465,
1822
+ "learning_rate": 9.501738122827347e-07,
1823
+ "loss": 0.2404,
1824
+ "step": 10100
1825
+ },
1826
+ {
1827
+ "epoch": 5.773606370875996,
1828
+ "grad_norm": 4.210214614868164,
1829
+ "learning_rate": 8.469398504160961e-07,
1830
+ "loss": 0.2397,
1831
+ "step": 10150
1832
+ },
1833
+ {
1834
+ "epoch": 5.802047781569966,
1835
+ "grad_norm": 4.265695095062256,
1836
+ "learning_rate": 7.415990730011588e-07,
1837
+ "loss": 0.2417,
1838
+ "step": 10200
1839
+ },
1840
+ {
1841
+ "epoch": 5.802047781569966,
1842
+ "eval_loss": 0.8124446868896484,
1843
+ "eval_runtime": 28.7474,
1844
+ "eval_samples_per_second": 869.643,
1845
+ "eval_steps_per_second": 6.818,
1846
+ "step": 10200
1847
+ },
1848
+ {
1849
+ "epoch": 5.830489192263936,
1850
+ "grad_norm": 4.166738033294678,
1851
+ "learning_rate": 6.362582955862215e-07,
1852
+ "loss": 0.2446,
1853
+ "step": 10250
1854
+ },
1855
+ {
1856
+ "epoch": 5.858930602957907,
1857
+ "grad_norm": 4.40815544128418,
1858
+ "learning_rate": 5.309175181712841e-07,
1859
+ "loss": 0.2443,
1860
+ "step": 10300
1861
+ },
1862
+ {
1863
+ "epoch": 5.887372013651877,
1864
+ "grad_norm": 3.757612466812134,
1865
+ "learning_rate": 4.255767407563468e-07,
1866
+ "loss": 0.2465,
1867
+ "step": 10350
1868
+ },
1869
+ {
1870
+ "epoch": 5.915813424345847,
1871
+ "grad_norm": 5.059196472167969,
1872
+ "learning_rate": 3.202359633414095e-07,
1873
+ "loss": 0.2472,
1874
+ "step": 10400
1875
+ },
1876
+ {
1877
+ "epoch": 5.915813424345847,
1878
+ "eval_loss": 0.8121369481086731,
1879
+ "eval_runtime": 28.8178,
1880
+ "eval_samples_per_second": 867.521,
1881
+ "eval_steps_per_second": 6.801,
1882
+ "step": 10400
1883
+ }
1884
+ ],
1885
+ "logging_steps": 50,
1886
+ "max_steps": 10548,
1887
+ "num_input_tokens_seen": 0,
1888
+ "num_train_epochs": 6,
1889
+ "save_steps": 200,
1890
+ "stateful_callbacks": {
1891
+ "TrainerControl": {
1892
+ "args": {
1893
+ "should_epoch_stop": false,
1894
+ "should_evaluate": false,
1895
+ "should_log": false,
1896
+ "should_save": true,
1897
+ "should_training_stop": false
1898
+ },
1899
+ "attributes": {}
1900
+ }
1901
+ },
1902
+ "total_flos": 0.0,
1903
+ "train_batch_size": 128,
1904
+ "trial_name": null,
1905
+ "trial_params": null
1906
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94f14bd101977dc9d8d5d74058c59de6994ee27ab85b7fdece6d94ff314f5cdc
3
+ size 6033
vocab.txt ADDED
The diff for this file is too large to render. See raw diff