pszemraj commited on
Commit
8716a81
·
verified ·
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ license: apache-2.0
6
+ base_model: facebook/bart-large
7
+ tags:
8
+ - map-reduce
9
+ - summarization
10
+ datasets:
11
+ - pszemraj/summary-map-reduce-v1
12
+ pipeline_tag: text2text-generation
13
+ thumbnail: >-
14
+ https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/Sv7_-MM901qNkyHuBdTC_.png
15
+ ---
16
+
17
+ # bart-large-summary-map-reduce
18
+
19
+ A text2text model to "map-reduce" summaries of a chunked long document into one.
20
+
21
+ An [explanation](https://github.com/pszemraj/textsum/wiki/consolidating-summaries) of this model's role as a post-processor for [textsum](https://github.com/pszemraj/textsum) (_or any other long-doc summarization method similar to the below_):
22
+
23
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/Sv7_-MM901qNkyHuBdTC_.png)
24
+
25
+ <small> modified flowchart from Google's blog [here](https://cloud.google.com/blog/products/ai-machine-learning/long-document-summarization-with-workflows-and-gemini-models) </small>
26
+
27
+ ## Details
28
+ This model is a fine-tuned version of [facebook/bart-large](https://huggingface.co/facebook/bart-large) on the pszemraj/summary-map-reduce dataset.
29
+ It achieves the following results on the evaluation set:
30
+ - Loss: 0.7894
31
+ - Num Input Tokens Seen: 14258488
32
+
33
+
34
+ ## usage
35
+
36
+ > [!TIP]
37
+ > BART supports several speedups for inference on GPU, including [flash-attention2](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) and [torch SDPA](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
38
+
39
+ an example of aggregating summaries from chunks of a long document:
40
+
41
+ ```py
42
+ import torch
43
+ from transformers import pipeline
44
+
45
+ pipe = pipeline(
46
+ "text2text-generation",
47
+ model="pszemraj/bart-large-summary-map-reduce",
48
+ device_map="auto",
49
+ )
50
+
51
+ # examples
52
+ text = """"Sangers on a Train" is a 1950 film about a train driver, Guy Haines, who discovers his wife, Miriam, has been murdered in Metcalf, Washington, DC. The film delves into the relationship between Guy and Anne Burton, focusing on Guy's desire for Anne to marry him.
53
+ "Screentalk" is a comedy about Anne Burton and her husband, Guy Haines, who are investigating the murder of their daughter, Miriam. The plot revolves around Anne's relationship with Bruno, who has been arrested for his wife's murder. In the second set, Guy and Anne meet at a tennis court in Washington, DC, where they plan to play against each other. Hennessy and Hammond investigate the crime scene, leading to Guy's arrest.
54
+ "The Announcer's Boom Forest Hills" is a tennis game between Guy Haines and Bruno Antony, with the score six-five. In the second set, Haines leads three games to four, but his opponent, Bernard Reynolds, attacks him in the third set. Meanwhile, Anne Hennessy and Barbara Hammond are preparing for dinner at the amusement park, where Guy has been waiting for hours. A police car arrives, followed by a taxi. The boatman and detectives follow Guy through the queue, leading to the conclusion that Guy was the man responsible for the accident."""
55
+
56
+ text = """A computer implemented method of generating a syntactic object. The method includes the steps of providing a plurality of input data sets, each input data set comprising one or more words, wherein each word is associated with at least one non-adjacent second word; creating an exocentric relationship between the first and second words by applying a neo-ian event semantics to the input data in such a way that the neo-antagonistic effect results in the generation of the syntactic object; and storing the generated syntactic object for future use.
57
+ A method of learning and using language is disclosed. The method includes the steps of creating a lexicon of words, wherein each word in the lexicon has at least two possible states, selecting a set of one or more of the possible states of the lexicon to be used as a base state for a subsequent computational operation, and applying the computational operation to the base state to form a new output state.
58
+ A computer implemented method for changing a first workspace to a second workspace. The method includes the steps of creating a new workspace by merging the first workspace with the second workspace, wherein the merging is based on at least one of: an impenetrable condition; a constraint on movement; and a resource restriction.
59
+ The brain is constantly loosing neurons because you doesn&#39;t want all the junk around."""
60
+
61
+ # generate
62
+ if torch.cuda.is_available():
63
+ torch.cuda.empty_cache()
64
+ res = pipe(
65
+ text,
66
+ max_new_tokens=512, # increase up to 1024 if needed
67
+ num_beams=4,
68
+ early_stopping=True,
69
+ truncation=True,
70
+ )
71
+ print(res[0]["generated_text"])
72
+ ```
73
+
74
+
75
+ ## Training procedure
76
+
77
+ ### Training hyperparameters
78
+
79
+ The following hyperparameters were used during training:
80
+ - learning_rate: 0.0001
81
+ - train_batch_size: 4
82
+ - eval_batch_size: 4
83
+ - seed: 17868
84
+ - gradient_accumulation_steps: 16
85
+ - total_train_batch_size: 64
86
+ - optimizer: Use OptimizerNames.PAGED_ADAMW with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
87
+ - lr_scheduler_type: cosine
88
+ - lr_scheduler_warmup_ratio: 0.05
89
+ - num_epochs: 3.0
all_results.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.9906542056074765,
3
+ "eval_loss": 0.78944993019104,
4
+ "eval_runtime": 0.9197,
5
+ "eval_samples": 150,
6
+ "eval_samples_per_second": 163.104,
7
+ "eval_steps_per_second": 41.32,
8
+ "num_input_tokens_seen": 14258488,
9
+ "total_flos": 3.0175424769490944e+16,
10
+ "train_loss": 0.895275863011678,
11
+ "train_runtime": 860.318,
12
+ "train_samples": 16692,
13
+ "train_samples_per_second": 58.206,
14
+ "train_steps_per_second": 0.907
15
+ }
config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "facebook/bart-large",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "gelu",
5
+ "add_bias_logits": false,
6
+ "add_final_layer_norm": false,
7
+ "architectures": [
8
+ "BartForConditionalGeneration"
9
+ ],
10
+ "attention_dropout": 0.1,
11
+ "bos_token_id": 0,
12
+ "classif_dropout": 0.1,
13
+ "classifier_dropout": 0.0,
14
+ "d_model": 1024,
15
+ "decoder_attention_heads": 16,
16
+ "decoder_ffn_dim": 4096,
17
+ "decoder_layerdrop": 0.0,
18
+ "decoder_layers": 12,
19
+ "decoder_start_token_id": 2,
20
+ "dropout": 0.1,
21
+ "early_stopping": null,
22
+ "encoder_attention_heads": 16,
23
+ "encoder_ffn_dim": 4096,
24
+ "encoder_layerdrop": 0.0,
25
+ "encoder_layers": 12,
26
+ "eos_token_id": 2,
27
+ "forced_eos_token_id": 2,
28
+ "gradient_checkpointing": false,
29
+ "id2label": {
30
+ "0": "LABEL_0",
31
+ "1": "LABEL_1",
32
+ "2": "LABEL_2"
33
+ },
34
+ "init_std": 0.02,
35
+ "is_encoder_decoder": true,
36
+ "label2id": {
37
+ "LABEL_0": 0,
38
+ "LABEL_1": 1,
39
+ "LABEL_2": 2
40
+ },
41
+ "max_position_embeddings": 1024,
42
+ "model_type": "bart",
43
+ "no_repeat_ngram_size": null,
44
+ "normalize_before": false,
45
+ "num_beams": null,
46
+ "num_hidden_layers": 12,
47
+ "pad_token_id": 1,
48
+ "scale_embedding": false,
49
+ "task_specific_params": {
50
+ "summarization": {
51
+ "length_penalty": 1.0,
52
+ "max_length": 128,
53
+ "min_length": 12,
54
+ "num_beams": 4
55
+ },
56
+ "summarization_cnn": {
57
+ "length_penalty": 2.0,
58
+ "max_length": 142,
59
+ "min_length": 56,
60
+ "num_beams": 4
61
+ },
62
+ "summarization_xsum": {
63
+ "length_penalty": 1.0,
64
+ "max_length": 62,
65
+ "min_length": 11,
66
+ "num_beams": 6
67
+ }
68
+ },
69
+ "torch_dtype": "float32",
70
+ "transformers_version": "4.46.0.dev0",
71
+ "use_cache": true,
72
+ "vocab_size": 50304
73
+ }
eval_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.9906542056074765,
3
+ "eval_loss": 0.78944993019104,
4
+ "eval_runtime": 0.9197,
5
+ "eval_samples": 150,
6
+ "eval_samples_per_second": 163.104,
7
+ "eval_steps_per_second": 41.32,
8
+ "num_input_tokens_seen": 14258488
9
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "decoder_start_token_id": 2,
5
+ "early_stopping": true,
6
+ "eos_token_id": 2,
7
+ "forced_bos_token_id": 0,
8
+ "forced_eos_token_id": 2,
9
+ "no_repeat_ngram_size": 3,
10
+ "num_beams": 4,
11
+ "pad_token_id": 1,
12
+ "transformers_version": "4.46.0.dev0"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59614eeaadafdb8d5df62cf77270bca4d4dbd9a8c82e88e0957148ac7f682cfc
3
+ size 1625586896
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 1024,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "BartTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
train_results.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.9906542056074765,
3
+ "num_input_tokens_seen": 14258488,
4
+ "total_flos": 3.0175424769490944e+16,
5
+ "train_loss": 0.895275863011678,
6
+ "train_runtime": 860.318,
7
+ "train_samples": 16692,
8
+ "train_samples_per_second": 58.206,
9
+ "train_steps_per_second": 0.907
10
+ }
trainer_state.json ADDED
@@ -0,0 +1,1354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 2.9906542056074765,
5
+ "eval_steps": 100,
6
+ "global_step": 780,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.01917086029235562,
13
+ "grad_norm": 10.80891227722168,
14
+ "learning_rate": 1.282051282051282e-05,
15
+ "loss": 2.5867,
16
+ "num_input_tokens_seen": 85900,
17
+ "step": 5
18
+ },
19
+ {
20
+ "epoch": 0.03834172058471124,
21
+ "grad_norm": 5.320851802825928,
22
+ "learning_rate": 2.564102564102564e-05,
23
+ "loss": 1.778,
24
+ "num_input_tokens_seen": 168784,
25
+ "step": 10
26
+ },
27
+ {
28
+ "epoch": 0.05751258087706686,
29
+ "grad_norm": 2.427093982696533,
30
+ "learning_rate": 3.846153846153846e-05,
31
+ "loss": 1.5006,
32
+ "num_input_tokens_seen": 262708,
33
+ "step": 15
34
+ },
35
+ {
36
+ "epoch": 0.07668344116942248,
37
+ "grad_norm": 2.143993854522705,
38
+ "learning_rate": 5.128205128205128e-05,
39
+ "loss": 1.3452,
40
+ "num_input_tokens_seen": 350572,
41
+ "step": 20
42
+ },
43
+ {
44
+ "epoch": 0.09585430146177809,
45
+ "grad_norm": 2.46453595161438,
46
+ "learning_rate": 6.410256410256412e-05,
47
+ "loss": 1.2953,
48
+ "num_input_tokens_seen": 441976,
49
+ "step": 25
50
+ },
51
+ {
52
+ "epoch": 0.11502516175413371,
53
+ "grad_norm": 2.382268190383911,
54
+ "learning_rate": 7.692307692307693e-05,
55
+ "loss": 1.3104,
56
+ "num_input_tokens_seen": 525052,
57
+ "step": 30
58
+ },
59
+ {
60
+ "epoch": 0.13419602204648934,
61
+ "grad_norm": 2.990694761276245,
62
+ "learning_rate": 8.974358974358975e-05,
63
+ "loss": 1.306,
64
+ "num_input_tokens_seen": 626524,
65
+ "step": 35
66
+ },
67
+ {
68
+ "epoch": 0.15336688233884496,
69
+ "grad_norm": 1.8448060750961304,
70
+ "learning_rate": 9.99995506314361e-05,
71
+ "loss": 1.2473,
72
+ "num_input_tokens_seen": 709868,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.17253774263120059,
77
+ "grad_norm": 2.6468124389648438,
78
+ "learning_rate": 9.998382357979809e-05,
79
+ "loss": 1.2091,
80
+ "num_input_tokens_seen": 823312,
81
+ "step": 45
82
+ },
83
+ {
84
+ "epoch": 0.19170860292355618,
85
+ "grad_norm": 1.919690728187561,
86
+ "learning_rate": 9.994563617659665e-05,
87
+ "loss": 1.1985,
88
+ "num_input_tokens_seen": 915100,
89
+ "step": 50
90
+ },
91
+ {
92
+ "epoch": 0.2108794632159118,
93
+ "grad_norm": 2.13997483253479,
94
+ "learning_rate": 9.988500558143337e-05,
95
+ "loss": 1.1922,
96
+ "num_input_tokens_seen": 1014472,
97
+ "step": 55
98
+ },
99
+ {
100
+ "epoch": 0.23005032350826743,
101
+ "grad_norm": 1.7454973459243774,
102
+ "learning_rate": 9.980195903881232e-05,
103
+ "loss": 1.2273,
104
+ "num_input_tokens_seen": 1111288,
105
+ "step": 60
106
+ },
107
+ {
108
+ "epoch": 0.24922118380062305,
109
+ "grad_norm": 2.0489871501922607,
110
+ "learning_rate": 9.969653386589748e-05,
111
+ "loss": 1.1729,
112
+ "num_input_tokens_seen": 1193792,
113
+ "step": 65
114
+ },
115
+ {
116
+ "epoch": 0.2683920440929787,
117
+ "grad_norm": 1.851652979850769,
118
+ "learning_rate": 9.956877743574438e-05,
119
+ "loss": 1.188,
120
+ "num_input_tokens_seen": 1293416,
121
+ "step": 70
122
+ },
123
+ {
124
+ "epoch": 0.2875629043853343,
125
+ "grad_norm": 2.3986475467681885,
126
+ "learning_rate": 9.94187471560127e-05,
127
+ "loss": 1.1459,
128
+ "num_input_tokens_seen": 1383712,
129
+ "step": 75
130
+ },
131
+ {
132
+ "epoch": 0.3067337646776899,
133
+ "grad_norm": 1.6811550855636597,
134
+ "learning_rate": 9.924651044317017e-05,
135
+ "loss": 1.1561,
136
+ "num_input_tokens_seen": 1483596,
137
+ "step": 80
138
+ },
139
+ {
140
+ "epoch": 0.3259046249700455,
141
+ "grad_norm": 2.3249671459198,
142
+ "learning_rate": 9.90521446921987e-05,
143
+ "loss": 1.1224,
144
+ "num_input_tokens_seen": 1561392,
145
+ "step": 85
146
+ },
147
+ {
148
+ "epoch": 0.34507548526240117,
149
+ "grad_norm": 2.17244553565979,
150
+ "learning_rate": 9.883573724181683e-05,
151
+ "loss": 1.1283,
152
+ "num_input_tokens_seen": 1660352,
153
+ "step": 90
154
+ },
155
+ {
156
+ "epoch": 0.36424634555475677,
157
+ "grad_norm": 2.0246732234954834,
158
+ "learning_rate": 9.859738533523383e-05,
159
+ "loss": 1.155,
160
+ "num_input_tokens_seen": 1751904,
161
+ "step": 95
162
+ },
163
+ {
164
+ "epoch": 0.38341720584711236,
165
+ "grad_norm": 1.6448779106140137,
166
+ "learning_rate": 9.833719607645324e-05,
167
+ "loss": 1.0645,
168
+ "num_input_tokens_seen": 1844404,
169
+ "step": 100
170
+ },
171
+ {
172
+ "epoch": 0.38341720584711236,
173
+ "eval_loss": 0.9265391826629639,
174
+ "eval_runtime": 0.8899,
175
+ "eval_samples_per_second": 168.564,
176
+ "eval_steps_per_second": 42.703,
177
+ "num_input_tokens_seen": 1844404,
178
+ "step": 100
179
+ },
180
+ {
181
+ "epoch": 0.402588066139468,
182
+ "grad_norm": 3.241382598876953,
183
+ "learning_rate": 9.805528638214542e-05,
184
+ "loss": 1.1597,
185
+ "num_input_tokens_seen": 1930376,
186
+ "step": 105
187
+ },
188
+ {
189
+ "epoch": 0.4217589264318236,
190
+ "grad_norm": 1.7404630184173584,
191
+ "learning_rate": 9.77517829291108e-05,
192
+ "loss": 1.112,
193
+ "num_input_tokens_seen": 2014196,
194
+ "step": 110
195
+ },
196
+ {
197
+ "epoch": 0.44092978672417926,
198
+ "grad_norm": 2.0619559288024902,
199
+ "learning_rate": 9.742682209735727e-05,
200
+ "loss": 1.1078,
201
+ "num_input_tokens_seen": 2105080,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 0.46010064701653486,
206
+ "grad_norm": 1.618997573852539,
207
+ "learning_rate": 9.708054990881763e-05,
208
+ "loss": 1.123,
209
+ "num_input_tokens_seen": 2212296,
210
+ "step": 120
211
+ },
212
+ {
213
+ "epoch": 0.4792715073088905,
214
+ "grad_norm": 2.0891501903533936,
215
+ "learning_rate": 9.671312196173412e-05,
216
+ "loss": 1.1139,
217
+ "num_input_tokens_seen": 2297740,
218
+ "step": 125
219
+ },
220
+ {
221
+ "epoch": 0.4984423676012461,
222
+ "grad_norm": 2.8104231357574463,
223
+ "learning_rate": 9.632470336074009e-05,
224
+ "loss": 1.1419,
225
+ "num_input_tokens_seen": 2393552,
226
+ "step": 130
227
+ },
228
+ {
229
+ "epoch": 0.5176132278936018,
230
+ "grad_norm": 2.313002586364746,
231
+ "learning_rate": 9.591546864266983e-05,
232
+ "loss": 1.1122,
233
+ "num_input_tokens_seen": 2485636,
234
+ "step": 135
235
+ },
236
+ {
237
+ "epoch": 0.5367840881859574,
238
+ "grad_norm": 1.6471989154815674,
239
+ "learning_rate": 9.548560169812997e-05,
240
+ "loss": 1.0872,
241
+ "num_input_tokens_seen": 2579380,
242
+ "step": 140
243
+ },
244
+ {
245
+ "epoch": 0.555954948478313,
246
+ "grad_norm": 1.8052936792373657,
247
+ "learning_rate": 9.50352956888678e-05,
248
+ "loss": 1.1084,
249
+ "num_input_tokens_seen": 2660580,
250
+ "step": 145
251
+ },
252
+ {
253
+ "epoch": 0.5751258087706685,
254
+ "grad_norm": 1.9831300973892212,
255
+ "learning_rate": 9.45647529609736e-05,
256
+ "loss": 1.0753,
257
+ "num_input_tokens_seen": 2752664,
258
+ "step": 150
259
+ },
260
+ {
261
+ "epoch": 0.5942966690630243,
262
+ "grad_norm": 1.9232532978057861,
263
+ "learning_rate": 9.4074184953956e-05,
264
+ "loss": 1.0932,
265
+ "num_input_tokens_seen": 2835612,
266
+ "step": 155
267
+ },
268
+ {
269
+ "epoch": 0.6134675293553798,
270
+ "grad_norm": 3.607057571411133,
271
+ "learning_rate": 9.356381210573091e-05,
272
+ "loss": 1.0856,
273
+ "num_input_tokens_seen": 2927884,
274
+ "step": 160
275
+ },
276
+ {
277
+ "epoch": 0.6326383896477354,
278
+ "grad_norm": 1.5196906328201294,
279
+ "learning_rate": 9.303386375356752e-05,
280
+ "loss": 1.0765,
281
+ "num_input_tokens_seen": 3011316,
282
+ "step": 165
283
+ },
284
+ {
285
+ "epoch": 0.651809249940091,
286
+ "grad_norm": 2.0994279384613037,
287
+ "learning_rate": 9.248457803103476e-05,
288
+ "loss": 1.0743,
289
+ "num_input_tokens_seen": 3113324,
290
+ "step": 170
291
+ },
292
+ {
293
+ "epoch": 0.6709801102324466,
294
+ "grad_norm": 2.0873262882232666,
295
+ "learning_rate": 9.191620176099558e-05,
296
+ "loss": 1.0799,
297
+ "num_input_tokens_seen": 3205472,
298
+ "step": 175
299
+ },
300
+ {
301
+ "epoch": 0.6901509705248023,
302
+ "grad_norm": 1.7372593879699707,
303
+ "learning_rate": 9.132899034469647e-05,
304
+ "loss": 1.095,
305
+ "num_input_tokens_seen": 3303772,
306
+ "step": 180
307
+ },
308
+ {
309
+ "epoch": 0.7093218308171579,
310
+ "grad_norm": 1.7090203762054443,
311
+ "learning_rate": 9.072320764700223e-05,
312
+ "loss": 1.0832,
313
+ "num_input_tokens_seen": 3386080,
314
+ "step": 185
315
+ },
316
+ {
317
+ "epoch": 0.7284926911095135,
318
+ "grad_norm": 1.707291603088379,
319
+ "learning_rate": 9.009912587782771e-05,
320
+ "loss": 1.0376,
321
+ "num_input_tokens_seen": 3471524,
322
+ "step": 190
323
+ },
324
+ {
325
+ "epoch": 0.7476635514018691,
326
+ "grad_norm": 1.663960576057434,
327
+ "learning_rate": 8.945702546981969e-05,
328
+ "loss": 1.0802,
329
+ "num_input_tokens_seen": 3556336,
330
+ "step": 195
331
+ },
332
+ {
333
+ "epoch": 0.7668344116942247,
334
+ "grad_norm": 1.8464930057525635,
335
+ "learning_rate": 8.879719495234363e-05,
336
+ "loss": 1.0769,
337
+ "num_input_tokens_seen": 3640408,
338
+ "step": 200
339
+ },
340
+ {
341
+ "epoch": 0.7668344116942247,
342
+ "eval_loss": 0.862065851688385,
343
+ "eval_runtime": 0.8351,
344
+ "eval_samples_per_second": 179.628,
345
+ "eval_steps_per_second": 45.506,
346
+ "num_input_tokens_seen": 3640408,
347
+ "step": 200
348
+ },
349
+ {
350
+ "epoch": 0.7860052719865804,
351
+ "grad_norm": 1.7220999002456665,
352
+ "learning_rate": 8.811993082183243e-05,
353
+ "loss": 1.078,
354
+ "num_input_tokens_seen": 3731324,
355
+ "step": 205
356
+ },
357
+ {
358
+ "epoch": 0.805176132278936,
359
+ "grad_norm": 1.7034516334533691,
360
+ "learning_rate": 8.742553740855506e-05,
361
+ "loss": 1.0565,
362
+ "num_input_tokens_seen": 3822944,
363
+ "step": 210
364
+ },
365
+ {
366
+ "epoch": 0.8243469925712916,
367
+ "grad_norm": 2.327296257019043,
368
+ "learning_rate": 8.671432673986494e-05,
369
+ "loss": 1.0721,
370
+ "num_input_tokens_seen": 3922476,
371
+ "step": 215
372
+ },
373
+ {
374
+ "epoch": 0.8435178528636472,
375
+ "grad_norm": 1.6464370489120483,
376
+ "learning_rate": 8.598661839998972e-05,
377
+ "loss": 1.0573,
378
+ "num_input_tokens_seen": 4003388,
379
+ "step": 220
380
+ },
381
+ {
382
+ "epoch": 0.8626887131560029,
383
+ "grad_norm": 2.116698741912842,
384
+ "learning_rate": 8.524273938642538e-05,
385
+ "loss": 1.0459,
386
+ "num_input_tokens_seen": 4084052,
387
+ "step": 225
388
+ },
389
+ {
390
+ "epoch": 0.8818595734483585,
391
+ "grad_norm": 1.5513197183609009,
392
+ "learning_rate": 8.448302396299905e-05,
393
+ "loss": 1.073,
394
+ "num_input_tokens_seen": 4177072,
395
+ "step": 230
396
+ },
397
+ {
398
+ "epoch": 0.9010304337407141,
399
+ "grad_norm": 1.8118634223937988,
400
+ "learning_rate": 8.370781350966683e-05,
401
+ "loss": 1.0786,
402
+ "num_input_tokens_seen": 4272004,
403
+ "step": 235
404
+ },
405
+ {
406
+ "epoch": 0.9202012940330697,
407
+ "grad_norm": 1.542823076248169,
408
+ "learning_rate": 8.291745636911382e-05,
409
+ "loss": 1.0556,
410
+ "num_input_tokens_seen": 4367124,
411
+ "step": 240
412
+ },
413
+ {
414
+ "epoch": 0.9393721543254253,
415
+ "grad_norm": 1.5214637517929077,
416
+ "learning_rate": 8.211230769022551e-05,
417
+ "loss": 1.052,
418
+ "num_input_tokens_seen": 4454860,
419
+ "step": 245
420
+ },
421
+ {
422
+ "epoch": 0.958543014617781,
423
+ "grad_norm": 1.5787030458450317,
424
+ "learning_rate": 8.129272926850079e-05,
425
+ "loss": 1.032,
426
+ "num_input_tokens_seen": 4544744,
427
+ "step": 250
428
+ },
429
+ {
430
+ "epoch": 0.9777138749101366,
431
+ "grad_norm": 1.842089295387268,
432
+ "learning_rate": 8.045908938347828e-05,
433
+ "loss": 1.0585,
434
+ "num_input_tokens_seen": 4627284,
435
+ "step": 255
436
+ },
437
+ {
438
+ "epoch": 0.9968847352024922,
439
+ "grad_norm": 1.8476406335830688,
440
+ "learning_rate": 7.961176263324901e-05,
441
+ "loss": 1.0045,
442
+ "num_input_tokens_seen": 4715336,
443
+ "step": 260
444
+ },
445
+ {
446
+ "epoch": 1.0160555954948478,
447
+ "grad_norm": 1.540380835533142,
448
+ "learning_rate": 7.875112976612984e-05,
449
+ "loss": 0.8789,
450
+ "num_input_tokens_seen": 4811720,
451
+ "step": 265
452
+ },
453
+ {
454
+ "epoch": 1.0352264557872035,
455
+ "grad_norm": 1.593477725982666,
456
+ "learning_rate": 7.787757750957334e-05,
457
+ "loss": 0.8548,
458
+ "num_input_tokens_seen": 4904124,
459
+ "step": 270
460
+ },
461
+ {
462
+ "epoch": 1.054397316079559,
463
+ "grad_norm": 2.4607889652252197,
464
+ "learning_rate": 7.699149839639086e-05,
465
+ "loss": 0.8393,
466
+ "num_input_tokens_seen": 4997508,
467
+ "step": 275
468
+ },
469
+ {
470
+ "epoch": 1.0735681763719147,
471
+ "grad_norm": 2.0560553073883057,
472
+ "learning_rate": 7.609329058836695e-05,
473
+ "loss": 0.8517,
474
+ "num_input_tokens_seen": 5102324,
475
+ "step": 280
476
+ },
477
+ {
478
+ "epoch": 1.0927390366642702,
479
+ "grad_norm": 1.5116240978240967,
480
+ "learning_rate": 7.518335769734439e-05,
481
+ "loss": 0.8498,
482
+ "num_input_tokens_seen": 5203192,
483
+ "step": 285
484
+ },
485
+ {
486
+ "epoch": 1.111909896956626,
487
+ "grad_norm": 1.517767071723938,
488
+ "learning_rate": 7.426210860386031e-05,
489
+ "loss": 0.8269,
490
+ "num_input_tokens_seen": 5300184,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 1.1310807572489816,
495
+ "grad_norm": 1.6254216432571411,
496
+ "learning_rate": 7.332995727341462e-05,
497
+ "loss": 0.8496,
498
+ "num_input_tokens_seen": 5405724,
499
+ "step": 295
500
+ },
501
+ {
502
+ "epoch": 1.150251617541337,
503
+ "grad_norm": 1.6421043872833252,
504
+ "learning_rate": 7.238732257045372e-05,
505
+ "loss": 0.849,
506
+ "num_input_tokens_seen": 5504644,
507
+ "step": 300
508
+ },
509
+ {
510
+ "epoch": 1.150251617541337,
511
+ "eval_loss": 0.8501759767532349,
512
+ "eval_runtime": 0.8525,
513
+ "eval_samples_per_second": 175.946,
514
+ "eval_steps_per_second": 44.573,
515
+ "num_input_tokens_seen": 5504644,
516
+ "step": 300
517
+ },
518
+ {
519
+ "epoch": 1.1694224778336928,
520
+ "grad_norm": 1.7481452226638794,
521
+ "learning_rate": 7.143462807015271e-05,
522
+ "loss": 0.8823,
523
+ "num_input_tokens_seen": 5597560,
524
+ "step": 305
525
+ },
526
+ {
527
+ "epoch": 1.1885933381260485,
528
+ "grad_norm": 1.725755214691162,
529
+ "learning_rate": 7.047230186808085e-05,
530
+ "loss": 0.8701,
531
+ "num_input_tokens_seen": 5691244,
532
+ "step": 310
533
+ },
534
+ {
535
+ "epoch": 1.207764198418404,
536
+ "grad_norm": 1.4096667766571045,
537
+ "learning_rate": 6.950077638783578e-05,
538
+ "loss": 0.8501,
539
+ "num_input_tokens_seen": 5773336,
540
+ "step": 315
541
+ },
542
+ {
543
+ "epoch": 1.2269350587107597,
544
+ "grad_norm": 1.5906323194503784,
545
+ "learning_rate": 6.8520488186733e-05,
546
+ "loss": 0.8684,
547
+ "num_input_tokens_seen": 5849224,
548
+ "step": 320
549
+ },
550
+ {
551
+ "epoch": 1.2461059190031152,
552
+ "grad_norm": 2.2204673290252686,
553
+ "learning_rate": 6.753187775963773e-05,
554
+ "loss": 0.8688,
555
+ "num_input_tokens_seen": 5943104,
556
+ "step": 325
557
+ },
558
+ {
559
+ "epoch": 1.2652767792954709,
560
+ "grad_norm": 1.40131413936615,
561
+ "learning_rate": 6.653538934102743e-05,
562
+ "loss": 0.8495,
563
+ "num_input_tokens_seen": 6051232,
564
+ "step": 330
565
+ },
566
+ {
567
+ "epoch": 1.2844476395878264,
568
+ "grad_norm": 1.578782558441162,
569
+ "learning_rate": 6.553147070537413e-05,
570
+ "loss": 0.8218,
571
+ "num_input_tokens_seen": 6137520,
572
+ "step": 335
573
+ },
574
+ {
575
+ "epoch": 1.303618499880182,
576
+ "grad_norm": 1.5068399906158447,
577
+ "learning_rate": 6.452057296593568e-05,
578
+ "loss": 0.8964,
579
+ "num_input_tokens_seen": 6237012,
580
+ "step": 340
581
+ },
582
+ {
583
+ "epoch": 1.3227893601725378,
584
+ "grad_norm": 1.6630327701568604,
585
+ "learning_rate": 6.350315037204714e-05,
586
+ "loss": 0.8748,
587
+ "num_input_tokens_seen": 6332764,
588
+ "step": 345
589
+ },
590
+ {
591
+ "epoch": 1.3419602204648933,
592
+ "grad_norm": 1.569263219833374,
593
+ "learning_rate": 6.247966010500258e-05,
594
+ "loss": 0.8478,
595
+ "num_input_tokens_seen": 6416688,
596
+ "step": 350
597
+ },
598
+ {
599
+ "epoch": 1.361131080757249,
600
+ "grad_norm": 1.4157441854476929,
601
+ "learning_rate": 6.145056207261964e-05,
602
+ "loss": 0.8624,
603
+ "num_input_tokens_seen": 6507660,
604
+ "step": 355
605
+ },
606
+ {
607
+ "epoch": 1.3803019410496047,
608
+ "grad_norm": 1.4510629177093506,
609
+ "learning_rate": 6.0416318702578826e-05,
610
+ "loss": 0.851,
611
+ "num_input_tokens_seen": 6608708,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 1.3994728013419602,
616
+ "grad_norm": 1.660876989364624,
617
+ "learning_rate": 5.9377394734630464e-05,
618
+ "loss": 0.8401,
619
+ "num_input_tokens_seen": 6700852,
620
+ "step": 365
621
+ },
622
+ {
623
+ "epoch": 1.4186436616343159,
624
+ "grad_norm": 1.4707056283950806,
625
+ "learning_rate": 5.833425701176294e-05,
626
+ "loss": 0.8646,
627
+ "num_input_tokens_seen": 6792244,
628
+ "step": 370
629
+ },
630
+ {
631
+ "epoch": 1.4378145219266716,
632
+ "grad_norm": 1.563291311264038,
633
+ "learning_rate": 5.728737427042548e-05,
634
+ "loss": 0.8732,
635
+ "num_input_tokens_seen": 6875536,
636
+ "step": 375
637
+ },
638
+ {
639
+ "epoch": 1.456985382219027,
640
+ "grad_norm": 1.286574125289917,
641
+ "learning_rate": 5.623721692990039e-05,
642
+ "loss": 0.8449,
643
+ "num_input_tokens_seen": 6958384,
644
+ "step": 380
645
+ },
646
+ {
647
+ "epoch": 1.4761562425113828,
648
+ "grad_norm": 1.413732886314392,
649
+ "learning_rate": 5.518425688091906e-05,
650
+ "loss": 0.8459,
651
+ "num_input_tokens_seen": 7040740,
652
+ "step": 385
653
+ },
654
+ {
655
+ "epoch": 1.4953271028037383,
656
+ "grad_norm": 1.3040345907211304,
657
+ "learning_rate": 5.4128967273616625e-05,
658
+ "loss": 0.8539,
659
+ "num_input_tokens_seen": 7132892,
660
+ "step": 390
661
+ },
662
+ {
663
+ "epoch": 1.514497963096094,
664
+ "grad_norm": 1.4091562032699585,
665
+ "learning_rate": 5.307182230492088e-05,
666
+ "loss": 0.816,
667
+ "num_input_tokens_seen": 7217480,
668
+ "step": 395
669
+ },
670
+ {
671
+ "epoch": 1.5336688233884495,
672
+ "grad_norm": 1.5895339250564575,
673
+ "learning_rate": 5.201329700547076e-05,
674
+ "loss": 0.8612,
675
+ "num_input_tokens_seen": 7316212,
676
+ "step": 400
677
+ },
678
+ {
679
+ "epoch": 1.5336688233884495,
680
+ "eval_loss": 0.8288899064064026,
681
+ "eval_runtime": 0.8525,
682
+ "eval_samples_per_second": 175.948,
683
+ "eval_steps_per_second": 44.574,
684
+ "num_input_tokens_seen": 7316212,
685
+ "step": 400
686
+ },
687
+ {
688
+ "epoch": 1.5528396836808052,
689
+ "grad_norm": 1.523808240890503,
690
+ "learning_rate": 5.095386702616012e-05,
691
+ "loss": 0.8411,
692
+ "num_input_tokens_seen": 7397436,
693
+ "step": 405
694
+ },
695
+ {
696
+ "epoch": 1.5720105439731609,
697
+ "grad_norm": 1.4366319179534912,
698
+ "learning_rate": 4.989400842440289e-05,
699
+ "loss": 0.8179,
700
+ "num_input_tokens_seen": 7489804,
701
+ "step": 410
702
+ },
703
+ {
704
+ "epoch": 1.5911814042655164,
705
+ "grad_norm": 1.335564136505127,
706
+ "learning_rate": 4.883419745021554e-05,
707
+ "loss": 0.8321,
708
+ "num_input_tokens_seen": 7575200,
709
+ "step": 415
710
+ },
711
+ {
712
+ "epoch": 1.610352264557872,
713
+ "grad_norm": 1.5225696563720703,
714
+ "learning_rate": 4.7774910332213e-05,
715
+ "loss": 0.8408,
716
+ "num_input_tokens_seen": 7661620,
717
+ "step": 420
718
+ },
719
+ {
720
+ "epoch": 1.6295231248502278,
721
+ "grad_norm": 1.286909580230713,
722
+ "learning_rate": 4.6716623063614094e-05,
723
+ "loss": 0.8335,
724
+ "num_input_tokens_seen": 7751008,
725
+ "step": 425
726
+ },
727
+ {
728
+ "epoch": 1.6486939851425833,
729
+ "grad_norm": 1.399095058441162,
730
+ "learning_rate": 4.565981118835299e-05,
731
+ "loss": 0.8586,
732
+ "num_input_tokens_seen": 7847504,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 1.6678648454349387,
737
+ "grad_norm": 1.5877128839492798,
738
+ "learning_rate": 4.4604949587392234e-05,
739
+ "loss": 0.8451,
740
+ "num_input_tokens_seen": 7940004,
741
+ "step": 435
742
+ },
743
+ {
744
+ "epoch": 1.6870357057272947,
745
+ "grad_norm": 1.661634087562561,
746
+ "learning_rate": 4.355251226533396e-05,
747
+ "loss": 0.8517,
748
+ "num_input_tokens_seen": 8037312,
749
+ "step": 440
750
+ },
751
+ {
752
+ "epoch": 1.7062065660196502,
753
+ "grad_norm": 1.2558486461639404,
754
+ "learning_rate": 4.250297213742473e-05,
755
+ "loss": 0.8115,
756
+ "num_input_tokens_seen": 8134436,
757
+ "step": 445
758
+ },
759
+ {
760
+ "epoch": 1.7253774263120056,
761
+ "grad_norm": 1.4006855487823486,
762
+ "learning_rate": 4.145680081704989e-05,
763
+ "loss": 0.8471,
764
+ "num_input_tokens_seen": 8223592,
765
+ "step": 450
766
+ },
767
+ {
768
+ "epoch": 1.7445482866043613,
769
+ "grad_norm": 1.5870461463928223,
770
+ "learning_rate": 4.0414468403813095e-05,
771
+ "loss": 0.8582,
772
+ "num_input_tokens_seen": 8312984,
773
+ "step": 455
774
+ },
775
+ {
776
+ "epoch": 1.763719146896717,
777
+ "grad_norm": 1.4359045028686523,
778
+ "learning_rate": 3.937644327229572e-05,
779
+ "loss": 0.8146,
780
+ "num_input_tokens_seen": 8402888,
781
+ "step": 460
782
+ },
783
+ {
784
+ "epoch": 1.7828900071890725,
785
+ "grad_norm": 1.4591436386108398,
786
+ "learning_rate": 3.8343191861591795e-05,
787
+ "loss": 0.8276,
788
+ "num_input_tokens_seen": 8503552,
789
+ "step": 465
790
+ },
791
+ {
792
+ "epoch": 1.8020608674814282,
793
+ "grad_norm": 1.6432543992996216,
794
+ "learning_rate": 3.7315178465712366e-05,
795
+ "loss": 0.8524,
796
+ "num_input_tokens_seen": 8604500,
797
+ "step": 470
798
+ },
799
+ {
800
+ "epoch": 1.821231727773784,
801
+ "grad_norm": 1.58722984790802,
802
+ "learning_rate": 3.629286502495394e-05,
803
+ "loss": 0.8257,
804
+ "num_input_tokens_seen": 8690612,
805
+ "step": 475
806
+ },
807
+ {
808
+ "epoch": 1.8404025880661394,
809
+ "grad_norm": 1.3321154117584229,
810
+ "learning_rate": 3.52767109183244e-05,
811
+ "loss": 0.8194,
812
+ "num_input_tokens_seen": 8785132,
813
+ "step": 480
814
+ },
815
+ {
816
+ "epoch": 1.8595734483584951,
817
+ "grad_norm": 1.8727712631225586,
818
+ "learning_rate": 3.426717275712e-05,
819
+ "loss": 0.8414,
820
+ "num_input_tokens_seen": 8882700,
821
+ "step": 485
822
+ },
823
+ {
824
+ "epoch": 1.8787443086508508,
825
+ "grad_norm": 1.427962303161621,
826
+ "learning_rate": 3.326470417974604e-05,
827
+ "loss": 0.8275,
828
+ "num_input_tokens_seen": 8982636,
829
+ "step": 490
830
+ },
831
+ {
832
+ "epoch": 1.8979151689432063,
833
+ "grad_norm": 1.391340970993042,
834
+ "learning_rate": 3.226975564787322e-05,
835
+ "loss": 0.8115,
836
+ "num_input_tokens_seen": 9090140,
837
+ "step": 495
838
+ },
839
+ {
840
+ "epoch": 1.9170860292355618,
841
+ "grad_norm": 1.465030312538147,
842
+ "learning_rate": 3.1282774244021715e-05,
843
+ "loss": 0.7934,
844
+ "num_input_tokens_seen": 9167936,
845
+ "step": 500
846
+ },
847
+ {
848
+ "epoch": 1.9170860292355618,
849
+ "eval_loss": 0.8071622252464294,
850
+ "eval_runtime": 0.8763,
851
+ "eval_samples_per_second": 171.177,
852
+ "eval_steps_per_second": 43.365,
853
+ "num_input_tokens_seen": 9167936,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 1.9362568895279175,
858
+ "grad_norm": 1.6803951263427734,
859
+ "learning_rate": 3.0304203470663505e-05,
860
+ "loss": 0.821,
861
+ "num_input_tokens_seen": 9252204,
862
+ "step": 505
863
+ },
864
+ {
865
+ "epoch": 1.9554277498202732,
866
+ "grad_norm": 1.4982653856277466,
867
+ "learning_rate": 2.9334483050933503e-05,
868
+ "loss": 0.7982,
869
+ "num_input_tokens_seen": 9326444,
870
+ "step": 510
871
+ },
872
+ {
873
+ "epoch": 1.9745986101126287,
874
+ "grad_norm": 1.5553169250488281,
875
+ "learning_rate": 2.8374048731038898e-05,
876
+ "loss": 0.8183,
877
+ "num_input_tokens_seen": 9412396,
878
+ "step": 515
879
+ },
880
+ {
881
+ "epoch": 1.9937694704049844,
882
+ "grad_norm": 1.176651954650879,
883
+ "learning_rate": 2.7423332084455544e-05,
884
+ "loss": 0.7837,
885
+ "num_input_tokens_seen": 9497412,
886
+ "step": 520
887
+ },
888
+ {
889
+ "epoch": 2.01294033069734,
890
+ "grad_norm": 1.2269119024276733,
891
+ "learning_rate": 2.648276031799934e-05,
892
+ "loss": 0.726,
893
+ "num_input_tokens_seen": 9594720,
894
+ "step": 525
895
+ },
896
+ {
897
+ "epoch": 2.0321111909896956,
898
+ "grad_norm": 1.4332973957061768,
899
+ "learning_rate": 2.5552756079859903e-05,
900
+ "loss": 0.6847,
901
+ "num_input_tokens_seen": 9675564,
902
+ "step": 530
903
+ },
904
+ {
905
+ "epoch": 2.051282051282051,
906
+ "grad_norm": 1.3018600940704346,
907
+ "learning_rate": 2.4633737269682543e-05,
908
+ "loss": 0.682,
909
+ "num_input_tokens_seen": 9772140,
910
+ "step": 535
911
+ },
912
+ {
913
+ "epoch": 2.070452911574407,
914
+ "grad_norm": 1.2701021432876587,
915
+ "learning_rate": 2.3726116850783985e-05,
916
+ "loss": 0.6643,
917
+ "num_input_tokens_seen": 9850508,
918
+ "step": 540
919
+ },
920
+ {
921
+ "epoch": 2.0896237718667625,
922
+ "grad_norm": 1.2966877222061157,
923
+ "learning_rate": 2.283030266458644e-05,
924
+ "loss": 0.6868,
925
+ "num_input_tokens_seen": 9949592,
926
+ "step": 545
927
+ },
928
+ {
929
+ "epoch": 2.108794632159118,
930
+ "grad_norm": 1.4499340057373047,
931
+ "learning_rate": 2.194669724735296e-05,
932
+ "loss": 0.6713,
933
+ "num_input_tokens_seen": 10041244,
934
+ "step": 550
935
+ },
936
+ {
937
+ "epoch": 2.127965492451474,
938
+ "grad_norm": 1.791567325592041,
939
+ "learning_rate": 2.1075697649306835e-05,
940
+ "loss": 0.6893,
941
+ "num_input_tokens_seen": 10140004,
942
+ "step": 555
943
+ },
944
+ {
945
+ "epoch": 2.1471363527438294,
946
+ "grad_norm": 1.4607737064361572,
947
+ "learning_rate": 2.0217695256216195e-05,
948
+ "loss": 0.6784,
949
+ "num_input_tokens_seen": 10233820,
950
+ "step": 560
951
+ },
952
+ {
953
+ "epoch": 2.166307213036185,
954
+ "grad_norm": 1.4586418867111206,
955
+ "learning_rate": 1.937307561352373e-05,
956
+ "loss": 0.6829,
957
+ "num_input_tokens_seen": 10324844,
958
+ "step": 565
959
+ },
960
+ {
961
+ "epoch": 2.1854780733285404,
962
+ "grad_norm": 1.4719526767730713,
963
+ "learning_rate": 1.854221825310103e-05,
964
+ "loss": 0.6775,
965
+ "num_input_tokens_seen": 10413592,
966
+ "step": 570
967
+ },
968
+ {
969
+ "epoch": 2.2046489336208963,
970
+ "grad_norm": 1.4231044054031372,
971
+ "learning_rate": 1.7725496522704998e-05,
972
+ "loss": 0.6872,
973
+ "num_input_tokens_seen": 10503432,
974
+ "step": 575
975
+ },
976
+ {
977
+ "epoch": 2.223819793913252,
978
+ "grad_norm": 1.3985334634780884,
979
+ "learning_rate": 1.6923277418213117e-05,
980
+ "loss": 0.65,
981
+ "num_input_tokens_seen": 10600988,
982
+ "step": 580
983
+ },
984
+ {
985
+ "epoch": 2.2429906542056073,
986
+ "grad_norm": 1.3148375749588013,
987
+ "learning_rate": 1.6135921418712956e-05,
988
+ "loss": 0.6901,
989
+ "num_input_tokens_seen": 10692664,
990
+ "step": 585
991
+ },
992
+ {
993
+ "epoch": 2.262161514497963,
994
+ "grad_norm": 1.3653095960617065,
995
+ "learning_rate": 1.536378232452003e-05,
996
+ "loss": 0.6661,
997
+ "num_input_tokens_seen": 10781516,
998
+ "step": 590
999
+ },
1000
+ {
1001
+ "epoch": 2.2813323747903187,
1002
+ "grad_norm": 1.4331799745559692,
1003
+ "learning_rate": 1.4607207098196852e-05,
1004
+ "loss": 0.669,
1005
+ "num_input_tokens_seen": 10874028,
1006
+ "step": 595
1007
+ },
1008
+ {
1009
+ "epoch": 2.300503235082674,
1010
+ "grad_norm": 1.3715401887893677,
1011
+ "learning_rate": 1.3866535708644334e-05,
1012
+ "loss": 0.6701,
1013
+ "num_input_tokens_seen": 10969348,
1014
+ "step": 600
1015
+ },
1016
+ {
1017
+ "epoch": 2.300503235082674,
1018
+ "eval_loss": 0.8050560355186462,
1019
+ "eval_runtime": 0.8838,
1020
+ "eval_samples_per_second": 169.723,
1021
+ "eval_steps_per_second": 42.997,
1022
+ "num_input_tokens_seen": 10969348,
1023
+ "step": 600
1024
+ },
1025
+ {
1026
+ "epoch": 2.31967409537503,
1027
+ "grad_norm": 1.2875298261642456,
1028
+ "learning_rate": 1.3142100978336069e-05,
1029
+ "loss": 0.6877,
1030
+ "num_input_tokens_seen": 11064696,
1031
+ "step": 605
1032
+ },
1033
+ {
1034
+ "epoch": 2.3388449556673856,
1035
+ "grad_norm": 1.2775810956954956,
1036
+ "learning_rate": 1.2434228433763657e-05,
1037
+ "loss": 0.6752,
1038
+ "num_input_tokens_seen": 11157516,
1039
+ "step": 610
1040
+ },
1041
+ {
1042
+ "epoch": 2.358015815959741,
1043
+ "grad_norm": 1.3270821571350098,
1044
+ "learning_rate": 1.1743236159160653e-05,
1045
+ "loss": 0.6915,
1046
+ "num_input_tokens_seen": 11236376,
1047
+ "step": 615
1048
+ },
1049
+ {
1050
+ "epoch": 2.377186676252097,
1051
+ "grad_norm": 1.2592318058013916,
1052
+ "learning_rate": 1.1069434653570631e-05,
1053
+ "loss": 0.6786,
1054
+ "num_input_tokens_seen": 11343512,
1055
+ "step": 620
1056
+ },
1057
+ {
1058
+ "epoch": 2.3963575365444525,
1059
+ "grad_norm": 1.4185535907745361,
1060
+ "learning_rate": 1.0413126691323666e-05,
1061
+ "loss": 0.6876,
1062
+ "num_input_tokens_seen": 11433156,
1063
+ "step": 625
1064
+ },
1065
+ {
1066
+ "epoch": 2.415528396836808,
1067
+ "grad_norm": 1.5168098211288452,
1068
+ "learning_rate": 9.774607185984002e-06,
1069
+ "loss": 0.6911,
1070
+ "num_input_tokens_seen": 11517328,
1071
+ "step": 630
1072
+ },
1073
+ {
1074
+ "epoch": 2.4346992571291635,
1075
+ "grad_norm": 1.2858244180679321,
1076
+ "learning_rate": 9.154163057829879e-06,
1077
+ "loss": 0.6618,
1078
+ "num_input_tokens_seen": 11602144,
1079
+ "step": 635
1080
+ },
1081
+ {
1082
+ "epoch": 2.4538701174215194,
1083
+ "grad_norm": 1.2297165393829346,
1084
+ "learning_rate": 8.552073104925295e-06,
1085
+ "loss": 0.6804,
1086
+ "num_input_tokens_seen": 11691216,
1087
+ "step": 640
1088
+ },
1089
+ {
1090
+ "epoch": 2.473040977713875,
1091
+ "grad_norm": 1.2545212507247925,
1092
+ "learning_rate": 7.968607877841332e-06,
1093
+ "loss": 0.6669,
1094
+ "num_input_tokens_seen": 11789340,
1095
+ "step": 645
1096
+ },
1097
+ {
1098
+ "epoch": 2.4922118380062304,
1099
+ "grad_norm": 1.3260319232940674,
1100
+ "learning_rate": 7.404029558083653e-06,
1101
+ "loss": 0.6847,
1102
+ "num_input_tokens_seen": 11884208,
1103
+ "step": 650
1104
+ },
1105
+ {
1106
+ "epoch": 2.5113826982985863,
1107
+ "grad_norm": 1.2979117631912231,
1108
+ "learning_rate": 6.858591840280626e-06,
1109
+ "loss": 0.6592,
1110
+ "num_input_tokens_seen": 11972124,
1111
+ "step": 655
1112
+ },
1113
+ {
1114
+ "epoch": 2.5305535585909418,
1115
+ "grad_norm": 1.2984734773635864,
1116
+ "learning_rate": 6.3325398181849845e-06,
1117
+ "loss": 0.6579,
1118
+ "num_input_tokens_seen": 12057320,
1119
+ "step": 660
1120
+ },
1121
+ {
1122
+ "epoch": 2.5497244188832973,
1123
+ "grad_norm": 1.2366037368774414,
1124
+ "learning_rate": 5.826109874540409e-06,
1125
+ "loss": 0.6666,
1126
+ "num_input_tokens_seen": 12154952,
1127
+ "step": 665
1128
+ },
1129
+ {
1130
+ "epoch": 2.5688952791756527,
1131
+ "grad_norm": 1.2756296396255493,
1132
+ "learning_rate": 5.33952957486234e-06,
1133
+ "loss": 0.6903,
1134
+ "num_input_tokens_seen": 12256604,
1135
+ "step": 670
1136
+ },
1137
+ {
1138
+ "epoch": 2.5880661394680087,
1139
+ "grad_norm": 1.3016470670700073,
1140
+ "learning_rate": 4.873017565180871e-06,
1141
+ "loss": 0.6654,
1142
+ "num_input_tokens_seen": 12341988,
1143
+ "step": 675
1144
+ },
1145
+ {
1146
+ "epoch": 2.607236999760364,
1147
+ "grad_norm": 1.3244975805282593,
1148
+ "learning_rate": 4.4267834737916296e-06,
1149
+ "loss": 0.6597,
1150
+ "num_input_tokens_seen": 12442644,
1151
+ "step": 680
1152
+ },
1153
+ {
1154
+ "epoch": 2.62640786005272,
1155
+ "grad_norm": 1.4668656587600708,
1156
+ "learning_rate": 4.001027817058789e-06,
1157
+ "loss": 0.6709,
1158
+ "num_input_tokens_seen": 12538688,
1159
+ "step": 685
1160
+ },
1161
+ {
1162
+ "epoch": 2.6455787203450756,
1163
+ "grad_norm": 1.4666439294815063,
1164
+ "learning_rate": 3.5959419093125946e-06,
1165
+ "loss": 0.6793,
1166
+ "num_input_tokens_seen": 12621632,
1167
+ "step": 690
1168
+ },
1169
+ {
1170
+ "epoch": 2.664749580637431,
1171
+ "grad_norm": 1.2680643796920776,
1172
+ "learning_rate": 3.211707776881739e-06,
1173
+ "loss": 0.6562,
1174
+ "num_input_tokens_seen": 12715832,
1175
+ "step": 695
1176
+ },
1177
+ {
1178
+ "epoch": 2.6839204409297865,
1179
+ "grad_norm": 1.3070735931396484,
1180
+ "learning_rate": 2.848498076299483e-06,
1181
+ "loss": 0.6579,
1182
+ "num_input_tokens_seen": 12814620,
1183
+ "step": 700
1184
+ },
1185
+ {
1186
+ "epoch": 2.6839204409297865,
1187
+ "eval_loss": 0.7903470396995544,
1188
+ "eval_runtime": 0.941,
1189
+ "eval_samples_per_second": 159.398,
1190
+ "eval_steps_per_second": 40.381,
1191
+ "num_input_tokens_seen": 12814620,
1192
+ "step": 700
1193
+ },
1194
+ {
1195
+ "epoch": 2.7030913012221425,
1196
+ "grad_norm": 1.4454238414764404,
1197
+ "learning_rate": 2.506476016719922e-06,
1198
+ "loss": 0.6818,
1199
+ "num_input_tokens_seen": 12909896,
1200
+ "step": 705
1201
+ },
1202
+ {
1203
+ "epoch": 2.722262161514498,
1204
+ "grad_norm": 1.249842882156372,
1205
+ "learning_rate": 2.1857952865796614e-06,
1206
+ "loss": 0.6655,
1207
+ "num_input_tokens_seen": 13001288,
1208
+ "step": 710
1209
+ },
1210
+ {
1211
+ "epoch": 2.7414330218068534,
1212
+ "grad_norm": 1.312296986579895,
1213
+ "learning_rate": 1.8865999845374793e-06,
1214
+ "loss": 0.656,
1215
+ "num_input_tokens_seen": 13094568,
1216
+ "step": 715
1217
+ },
1218
+ {
1219
+ "epoch": 2.7606038820992094,
1220
+ "grad_norm": 1.1614197492599487,
1221
+ "learning_rate": 1.6090245547232707e-06,
1222
+ "loss": 0.6664,
1223
+ "num_input_tokens_seen": 13181872,
1224
+ "step": 720
1225
+ },
1226
+ {
1227
+ "epoch": 2.779774742391565,
1228
+ "grad_norm": 1.249230146408081,
1229
+ "learning_rate": 1.353193726325247e-06,
1230
+ "loss": 0.6643,
1231
+ "num_input_tokens_seen": 13276044,
1232
+ "step": 725
1233
+ },
1234
+ {
1235
+ "epoch": 2.7989456026839203,
1236
+ "grad_norm": 1.3258079290390015,
1237
+ "learning_rate": 1.1192224575425848e-06,
1238
+ "loss": 0.6352,
1239
+ "num_input_tokens_seen": 13361008,
1240
+ "step": 730
1241
+ },
1242
+ {
1243
+ "epoch": 2.818116462976276,
1244
+ "grad_norm": 1.364700198173523,
1245
+ "learning_rate": 9.072158839286748e-07,
1246
+ "loss": 0.6686,
1247
+ "num_input_tokens_seen": 13458560,
1248
+ "step": 735
1249
+ },
1250
+ {
1251
+ "epoch": 2.8372873232686318,
1252
+ "grad_norm": 1.3140686750411987,
1253
+ "learning_rate": 7.172692711482021e-07,
1254
+ "loss": 0.6623,
1255
+ "num_input_tokens_seen": 13551928,
1256
+ "step": 740
1257
+ },
1258
+ {
1259
+ "epoch": 2.8564581835609872,
1260
+ "grad_norm": 1.4460844993591309,
1261
+ "learning_rate": 5.494679721693152e-07,
1262
+ "loss": 0.6658,
1263
+ "num_input_tokens_seen": 13632792,
1264
+ "step": 745
1265
+ },
1266
+ {
1267
+ "epoch": 2.875629043853343,
1268
+ "grad_norm": 1.2111197710037231,
1269
+ "learning_rate": 4.0388738891002366e-07,
1270
+ "loss": 0.6239,
1271
+ "num_input_tokens_seen": 13723172,
1272
+ "step": 750
1273
+ },
1274
+ {
1275
+ "epoch": 2.8947999041456987,
1276
+ "grad_norm": 1.2612218856811523,
1277
+ "learning_rate": 2.8059293835620003e-07,
1278
+ "loss": 0.6682,
1279
+ "num_input_tokens_seen": 13820332,
1280
+ "step": 755
1281
+ },
1282
+ {
1283
+ "epoch": 2.913970764438054,
1284
+ "grad_norm": 1.3601187467575073,
1285
+ "learning_rate": 1.7964002316628315e-07,
1286
+ "loss": 0.6729,
1287
+ "num_input_tokens_seen": 13907996,
1288
+ "step": 760
1289
+ },
1290
+ {
1291
+ "epoch": 2.9331416247304096,
1292
+ "grad_norm": 1.3947527408599854,
1293
+ "learning_rate": 1.0107400677596412e-07,
1294
+ "loss": 0.6736,
1295
+ "num_input_tokens_seen": 13993196,
1296
+ "step": 765
1297
+ },
1298
+ {
1299
+ "epoch": 2.9523124850227656,
1300
+ "grad_norm": 1.2355984449386597,
1301
+ "learning_rate": 4.493019301401446e-08,
1302
+ "loss": 0.6655,
1303
+ "num_input_tokens_seen": 14084984,
1304
+ "step": 770
1305
+ },
1306
+ {
1307
+ "epoch": 2.971483345315121,
1308
+ "grad_norm": 1.2893033027648926,
1309
+ "learning_rate": 1.1233810238425735e-08,
1310
+ "loss": 0.657,
1311
+ "num_input_tokens_seen": 14169848,
1312
+ "step": 775
1313
+ },
1314
+ {
1315
+ "epoch": 2.9906542056074765,
1316
+ "grad_norm": 1.3213859796524048,
1317
+ "learning_rate": 0.0,
1318
+ "loss": 0.6589,
1319
+ "num_input_tokens_seen": 14258488,
1320
+ "step": 780
1321
+ },
1322
+ {
1323
+ "epoch": 2.9906542056074765,
1324
+ "num_input_tokens_seen": 14258488,
1325
+ "step": 780,
1326
+ "total_flos": 3.0175424769490944e+16,
1327
+ "train_loss": 0.895275863011678,
1328
+ "train_runtime": 860.318,
1329
+ "train_samples_per_second": 58.206,
1330
+ "train_steps_per_second": 0.907
1331
+ }
1332
+ ],
1333
+ "logging_steps": 5,
1334
+ "max_steps": 780,
1335
+ "num_input_tokens_seen": 14258488,
1336
+ "num_train_epochs": 3,
1337
+ "save_steps": 100,
1338
+ "stateful_callbacks": {
1339
+ "TrainerControl": {
1340
+ "args": {
1341
+ "should_epoch_stop": false,
1342
+ "should_evaluate": false,
1343
+ "should_log": false,
1344
+ "should_save": true,
1345
+ "should_training_stop": true
1346
+ },
1347
+ "attributes": {}
1348
+ }
1349
+ },
1350
+ "total_flos": 3.0175424769490944e+16,
1351
+ "train_batch_size": 4,
1352
+ "trial_name": null,
1353
+ "trial_params": null
1354
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d85165a318ec430e48d0c7cff3da74e1488acdd1e9816f78db5817ca9366a4b
3
+ size 5496
vocab.json ADDED
The diff for this file is too large to render. See raw diff