rbelanec commited on
Commit
6061ce0
·
verified ·
1 Parent(s): d9ead3d

End of training

Browse files
README.md CHANGED
@@ -17,7 +17,7 @@ should probably proofread and complete it, then remove this comment. -->
17
 
18
  # train_copa_456_1760637759
19
 
20
- This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on an unknown dataset.
21
  It achieves the following results on the evaluation set:
22
  - Loss: 1.0934
23
  - Num Input Tokens Seen: 501440
 
17
 
18
  # train_copa_456_1760637759
19
 
20
+ This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the copa dataset.
21
  It achieves the following results on the evaluation set:
22
  - Loss: 1.0934
23
  - Num Input Tokens Seen: 501440
all_results.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "eval_loss": 1.093362808227539,
4
+ "eval_runtime": 0.9825,
5
+ "eval_samples_per_second": 81.421,
6
+ "eval_steps_per_second": 20.355,
7
+ "num_input_tokens_seen": 501440,
8
+ "total_flos": 2.257961656516608e+16,
9
+ "train_loss": 0.2722132059369324,
10
+ "train_runtime": 168.4339,
11
+ "train_samples_per_second": 37.997,
12
+ "train_steps_per_second": 9.499
13
+ }
eval_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "eval_loss": 1.093362808227539,
4
+ "eval_runtime": 0.9825,
5
+ "eval_samples_per_second": 81.421,
6
+ "eval_steps_per_second": 20.355,
7
+ "num_input_tokens_seen": 501440
8
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "num_input_tokens_seen": 501440,
4
+ "total_flos": 2.257961656516608e+16,
5
+ "train_loss": 0.2722132059369324,
6
+ "train_runtime": 168.4339,
7
+ "train_samples_per_second": 37.997,
8
+ "train_steps_per_second": 9.499
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,2694 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 320,
3
+ "best_metric": 0.23625226318836212,
4
+ "best_model_checkpoint": "saves_multiple/prefix-tuning/llama-3-8b-instruct/train_copa_456_1760637759/checkpoint-320",
5
+ "epoch": 20.0,
6
+ "eval_steps": 160,
7
+ "global_step": 1600,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.0625,
14
+ "grad_norm": 200.03012084960938,
15
+ "learning_rate": 2.5000000000000004e-07,
16
+ "loss": 9.2806,
17
+ "num_input_tokens_seen": 1632,
18
+ "step": 5
19
+ },
20
+ {
21
+ "epoch": 0.125,
22
+ "grad_norm": 214.69956970214844,
23
+ "learning_rate": 5.625e-07,
24
+ "loss": 8.7836,
25
+ "num_input_tokens_seen": 3136,
26
+ "step": 10
27
+ },
28
+ {
29
+ "epoch": 0.1875,
30
+ "grad_norm": 193.6563720703125,
31
+ "learning_rate": 8.75e-07,
32
+ "loss": 7.8812,
33
+ "num_input_tokens_seen": 4736,
34
+ "step": 15
35
+ },
36
+ {
37
+ "epoch": 0.25,
38
+ "grad_norm": 165.62625122070312,
39
+ "learning_rate": 1.1875e-06,
40
+ "loss": 6.7373,
41
+ "num_input_tokens_seen": 6304,
42
+ "step": 20
43
+ },
44
+ {
45
+ "epoch": 0.3125,
46
+ "grad_norm": 128.19512939453125,
47
+ "learning_rate": 1.5e-06,
48
+ "loss": 5.2865,
49
+ "num_input_tokens_seen": 7904,
50
+ "step": 25
51
+ },
52
+ {
53
+ "epoch": 0.375,
54
+ "grad_norm": 103.28244018554688,
55
+ "learning_rate": 1.8125e-06,
56
+ "loss": 3.977,
57
+ "num_input_tokens_seen": 9472,
58
+ "step": 30
59
+ },
60
+ {
61
+ "epoch": 0.4375,
62
+ "grad_norm": 80.52030944824219,
63
+ "learning_rate": 2.125e-06,
64
+ "loss": 2.8338,
65
+ "num_input_tokens_seen": 11008,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.5,
70
+ "grad_norm": 58.259361267089844,
71
+ "learning_rate": 2.4375e-06,
72
+ "loss": 1.5963,
73
+ "num_input_tokens_seen": 12512,
74
+ "step": 40
75
+ },
76
+ {
77
+ "epoch": 0.5625,
78
+ "grad_norm": 41.13335418701172,
79
+ "learning_rate": 2.7500000000000004e-06,
80
+ "loss": 0.8379,
81
+ "num_input_tokens_seen": 14016,
82
+ "step": 45
83
+ },
84
+ {
85
+ "epoch": 0.625,
86
+ "grad_norm": 44.33089828491211,
87
+ "learning_rate": 3.0625000000000003e-06,
88
+ "loss": 0.4101,
89
+ "num_input_tokens_seen": 15584,
90
+ "step": 50
91
+ },
92
+ {
93
+ "epoch": 0.6875,
94
+ "grad_norm": 89.49345397949219,
95
+ "learning_rate": 3.3750000000000003e-06,
96
+ "loss": 0.3284,
97
+ "num_input_tokens_seen": 17152,
98
+ "step": 55
99
+ },
100
+ {
101
+ "epoch": 0.75,
102
+ "grad_norm": 22.057518005371094,
103
+ "learning_rate": 3.6875000000000007e-06,
104
+ "loss": 0.3562,
105
+ "num_input_tokens_seen": 18752,
106
+ "step": 60
107
+ },
108
+ {
109
+ "epoch": 0.8125,
110
+ "grad_norm": 15.762816429138184,
111
+ "learning_rate": 4.000000000000001e-06,
112
+ "loss": 0.2688,
113
+ "num_input_tokens_seen": 20320,
114
+ "step": 65
115
+ },
116
+ {
117
+ "epoch": 0.875,
118
+ "grad_norm": 25.10136604309082,
119
+ "learning_rate": 4.312500000000001e-06,
120
+ "loss": 0.2826,
121
+ "num_input_tokens_seen": 21888,
122
+ "step": 70
123
+ },
124
+ {
125
+ "epoch": 0.9375,
126
+ "grad_norm": 22.71470832824707,
127
+ "learning_rate": 4.625000000000001e-06,
128
+ "loss": 0.2588,
129
+ "num_input_tokens_seen": 23424,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.0,
134
+ "grad_norm": 8.763684272766113,
135
+ "learning_rate": 4.937500000000001e-06,
136
+ "loss": 0.2796,
137
+ "num_input_tokens_seen": 24960,
138
+ "step": 80
139
+ },
140
+ {
141
+ "epoch": 1.0625,
142
+ "grad_norm": 24.20291519165039,
143
+ "learning_rate": 5.2500000000000006e-06,
144
+ "loss": 0.2615,
145
+ "num_input_tokens_seen": 26464,
146
+ "step": 85
147
+ },
148
+ {
149
+ "epoch": 1.125,
150
+ "grad_norm": 5.363375663757324,
151
+ "learning_rate": 5.5625000000000005e-06,
152
+ "loss": 0.263,
153
+ "num_input_tokens_seen": 28064,
154
+ "step": 90
155
+ },
156
+ {
157
+ "epoch": 1.1875,
158
+ "grad_norm": 34.018043518066406,
159
+ "learning_rate": 5.8750000000000005e-06,
160
+ "loss": 0.2422,
161
+ "num_input_tokens_seen": 29664,
162
+ "step": 95
163
+ },
164
+ {
165
+ "epoch": 1.25,
166
+ "grad_norm": 35.1256217956543,
167
+ "learning_rate": 6.1875000000000005e-06,
168
+ "loss": 0.2233,
169
+ "num_input_tokens_seen": 31264,
170
+ "step": 100
171
+ },
172
+ {
173
+ "epoch": 1.3125,
174
+ "grad_norm": 7.767378807067871,
175
+ "learning_rate": 6.5000000000000004e-06,
176
+ "loss": 0.2181,
177
+ "num_input_tokens_seen": 32864,
178
+ "step": 105
179
+ },
180
+ {
181
+ "epoch": 1.375,
182
+ "grad_norm": 14.077888488769531,
183
+ "learning_rate": 6.8125e-06,
184
+ "loss": 0.3463,
185
+ "num_input_tokens_seen": 34432,
186
+ "step": 110
187
+ },
188
+ {
189
+ "epoch": 1.4375,
190
+ "grad_norm": 10.013070106506348,
191
+ "learning_rate": 7.125e-06,
192
+ "loss": 0.2563,
193
+ "num_input_tokens_seen": 35968,
194
+ "step": 115
195
+ },
196
+ {
197
+ "epoch": 1.5,
198
+ "grad_norm": 7.696524143218994,
199
+ "learning_rate": 7.437500000000001e-06,
200
+ "loss": 0.2045,
201
+ "num_input_tokens_seen": 37536,
202
+ "step": 120
203
+ },
204
+ {
205
+ "epoch": 1.5625,
206
+ "grad_norm": 12.979708671569824,
207
+ "learning_rate": 7.75e-06,
208
+ "loss": 0.432,
209
+ "num_input_tokens_seen": 39136,
210
+ "step": 125
211
+ },
212
+ {
213
+ "epoch": 1.625,
214
+ "grad_norm": 19.278409957885742,
215
+ "learning_rate": 8.062500000000001e-06,
216
+ "loss": 0.2754,
217
+ "num_input_tokens_seen": 40736,
218
+ "step": 130
219
+ },
220
+ {
221
+ "epoch": 1.6875,
222
+ "grad_norm": 7.332155704498291,
223
+ "learning_rate": 8.375e-06,
224
+ "loss": 0.2547,
225
+ "num_input_tokens_seen": 42240,
226
+ "step": 135
227
+ },
228
+ {
229
+ "epoch": 1.75,
230
+ "grad_norm": 4.9524245262146,
231
+ "learning_rate": 8.687500000000001e-06,
232
+ "loss": 0.2304,
233
+ "num_input_tokens_seen": 43808,
234
+ "step": 140
235
+ },
236
+ {
237
+ "epoch": 1.8125,
238
+ "grad_norm": 7.520391941070557,
239
+ "learning_rate": 9e-06,
240
+ "loss": 0.218,
241
+ "num_input_tokens_seen": 45344,
242
+ "step": 145
243
+ },
244
+ {
245
+ "epoch": 1.875,
246
+ "grad_norm": 15.684959411621094,
247
+ "learning_rate": 9.312500000000001e-06,
248
+ "loss": 0.2485,
249
+ "num_input_tokens_seen": 46912,
250
+ "step": 150
251
+ },
252
+ {
253
+ "epoch": 1.9375,
254
+ "grad_norm": 3.4786770343780518,
255
+ "learning_rate": 9.625e-06,
256
+ "loss": 0.2414,
257
+ "num_input_tokens_seen": 48480,
258
+ "step": 155
259
+ },
260
+ {
261
+ "epoch": 2.0,
262
+ "grad_norm": 2.7771852016448975,
263
+ "learning_rate": 9.937500000000001e-06,
264
+ "loss": 0.2344,
265
+ "num_input_tokens_seen": 50080,
266
+ "step": 160
267
+ },
268
+ {
269
+ "epoch": 2.0,
270
+ "eval_loss": 0.23884077370166779,
271
+ "eval_runtime": 0.9157,
272
+ "eval_samples_per_second": 87.369,
273
+ "eval_steps_per_second": 21.842,
274
+ "num_input_tokens_seen": 50080,
275
+ "step": 160
276
+ },
277
+ {
278
+ "epoch": 2.0625,
279
+ "grad_norm": 4.3142242431640625,
280
+ "learning_rate": 9.999809615320857e-06,
281
+ "loss": 0.2257,
282
+ "num_input_tokens_seen": 51616,
283
+ "step": 165
284
+ },
285
+ {
286
+ "epoch": 2.125,
287
+ "grad_norm": 3.4161956310272217,
288
+ "learning_rate": 9.999036202410324e-06,
289
+ "loss": 0.1899,
290
+ "num_input_tokens_seen": 53216,
291
+ "step": 170
292
+ },
293
+ {
294
+ "epoch": 2.1875,
295
+ "grad_norm": 10.080194473266602,
296
+ "learning_rate": 9.997667954183566e-06,
297
+ "loss": 0.3096,
298
+ "num_input_tokens_seen": 54784,
299
+ "step": 175
300
+ },
301
+ {
302
+ "epoch": 2.25,
303
+ "grad_norm": 3.2639670372009277,
304
+ "learning_rate": 9.995705033448435e-06,
305
+ "loss": 0.2305,
306
+ "num_input_tokens_seen": 56384,
307
+ "step": 180
308
+ },
309
+ {
310
+ "epoch": 2.3125,
311
+ "grad_norm": 9.221940040588379,
312
+ "learning_rate": 9.993147673772869e-06,
313
+ "loss": 0.2628,
314
+ "num_input_tokens_seen": 57920,
315
+ "step": 185
316
+ },
317
+ {
318
+ "epoch": 2.375,
319
+ "grad_norm": 3.5883162021636963,
320
+ "learning_rate": 9.9899961794571e-06,
321
+ "loss": 0.2506,
322
+ "num_input_tokens_seen": 59520,
323
+ "step": 190
324
+ },
325
+ {
326
+ "epoch": 2.4375,
327
+ "grad_norm": 14.086549758911133,
328
+ "learning_rate": 9.986250925497429e-06,
329
+ "loss": 0.2822,
330
+ "num_input_tokens_seen": 61088,
331
+ "step": 195
332
+ },
333
+ {
334
+ "epoch": 2.5,
335
+ "grad_norm": 7.669594764709473,
336
+ "learning_rate": 9.981912357541628e-06,
337
+ "loss": 0.2747,
338
+ "num_input_tokens_seen": 62656,
339
+ "step": 200
340
+ },
341
+ {
342
+ "epoch": 2.5625,
343
+ "grad_norm": 2.2965734004974365,
344
+ "learning_rate": 9.976980991835896e-06,
345
+ "loss": 0.2442,
346
+ "num_input_tokens_seen": 64224,
347
+ "step": 205
348
+ },
349
+ {
350
+ "epoch": 2.625,
351
+ "grad_norm": 3.275341033935547,
352
+ "learning_rate": 9.971457415163435e-06,
353
+ "loss": 0.2616,
354
+ "num_input_tokens_seen": 65760,
355
+ "step": 210
356
+ },
357
+ {
358
+ "epoch": 2.6875,
359
+ "grad_norm": 2.5465989112854004,
360
+ "learning_rate": 9.965342284774633e-06,
361
+ "loss": 0.2425,
362
+ "num_input_tokens_seen": 67296,
363
+ "step": 215
364
+ },
365
+ {
366
+ "epoch": 2.75,
367
+ "grad_norm": 4.648524284362793,
368
+ "learning_rate": 9.958636328308852e-06,
369
+ "loss": 0.2764,
370
+ "num_input_tokens_seen": 68832,
371
+ "step": 220
372
+ },
373
+ {
374
+ "epoch": 2.8125,
375
+ "grad_norm": 10.161576271057129,
376
+ "learning_rate": 9.951340343707852e-06,
377
+ "loss": 0.255,
378
+ "num_input_tokens_seen": 70400,
379
+ "step": 225
380
+ },
381
+ {
382
+ "epoch": 2.875,
383
+ "grad_norm": 5.629575252532959,
384
+ "learning_rate": 9.943455199120836e-06,
385
+ "loss": 0.2356,
386
+ "num_input_tokens_seen": 72000,
387
+ "step": 230
388
+ },
389
+ {
390
+ "epoch": 2.9375,
391
+ "grad_norm": 4.087530612945557,
392
+ "learning_rate": 9.934981832801161e-06,
393
+ "loss": 0.2339,
394
+ "num_input_tokens_seen": 73536,
395
+ "step": 235
396
+ },
397
+ {
398
+ "epoch": 3.0,
399
+ "grad_norm": 1.1364092826843262,
400
+ "learning_rate": 9.925921252994677e-06,
401
+ "loss": 0.232,
402
+ "num_input_tokens_seen": 75104,
403
+ "step": 240
404
+ },
405
+ {
406
+ "epoch": 3.0625,
407
+ "grad_norm": 1.9829556941986084,
408
+ "learning_rate": 9.916274537819774e-06,
409
+ "loss": 0.2282,
410
+ "num_input_tokens_seen": 76640,
411
+ "step": 245
412
+ },
413
+ {
414
+ "epoch": 3.125,
415
+ "grad_norm": 1.830893635749817,
416
+ "learning_rate": 9.90604283513909e-06,
417
+ "loss": 0.2614,
418
+ "num_input_tokens_seen": 78208,
419
+ "step": 250
420
+ },
421
+ {
422
+ "epoch": 3.1875,
423
+ "grad_norm": 2.716417074203491,
424
+ "learning_rate": 9.89522736242292e-06,
425
+ "loss": 0.2393,
426
+ "num_input_tokens_seen": 79808,
427
+ "step": 255
428
+ },
429
+ {
430
+ "epoch": 3.25,
431
+ "grad_norm": 3.134875774383545,
432
+ "learning_rate": 9.883829406604363e-06,
433
+ "loss": 0.2281,
434
+ "num_input_tokens_seen": 81376,
435
+ "step": 260
436
+ },
437
+ {
438
+ "epoch": 3.3125,
439
+ "grad_norm": 12.125977516174316,
440
+ "learning_rate": 9.871850323926178e-06,
441
+ "loss": 0.2635,
442
+ "num_input_tokens_seen": 82912,
443
+ "step": 265
444
+ },
445
+ {
446
+ "epoch": 3.375,
447
+ "grad_norm": 1.7760015726089478,
448
+ "learning_rate": 9.859291539779407e-06,
449
+ "loss": 0.2369,
450
+ "num_input_tokens_seen": 84512,
451
+ "step": 270
452
+ },
453
+ {
454
+ "epoch": 3.4375,
455
+ "grad_norm": 1.7993685007095337,
456
+ "learning_rate": 9.846154548533773e-06,
457
+ "loss": 0.2255,
458
+ "num_input_tokens_seen": 86112,
459
+ "step": 275
460
+ },
461
+ {
462
+ "epoch": 3.5,
463
+ "grad_norm": 7.983632564544678,
464
+ "learning_rate": 9.83244091335986e-06,
465
+ "loss": 0.2446,
466
+ "num_input_tokens_seen": 87648,
467
+ "step": 280
468
+ },
469
+ {
470
+ "epoch": 3.5625,
471
+ "grad_norm": 4.111968517303467,
472
+ "learning_rate": 9.818152266043115e-06,
473
+ "loss": 0.2316,
474
+ "num_input_tokens_seen": 89280,
475
+ "step": 285
476
+ },
477
+ {
478
+ "epoch": 3.625,
479
+ "grad_norm": 4.560153007507324,
480
+ "learning_rate": 9.803290306789676e-06,
481
+ "loss": 0.2343,
482
+ "num_input_tokens_seen": 90816,
483
+ "step": 290
484
+ },
485
+ {
486
+ "epoch": 3.6875,
487
+ "grad_norm": 5.503631114959717,
488
+ "learning_rate": 9.787856804024073e-06,
489
+ "loss": 0.2224,
490
+ "num_input_tokens_seen": 92352,
491
+ "step": 295
492
+ },
493
+ {
494
+ "epoch": 3.75,
495
+ "grad_norm": 5.467092514038086,
496
+ "learning_rate": 9.771853594178791e-06,
497
+ "loss": 0.2373,
498
+ "num_input_tokens_seen": 93888,
499
+ "step": 300
500
+ },
501
+ {
502
+ "epoch": 3.8125,
503
+ "grad_norm": 4.816873550415039,
504
+ "learning_rate": 9.755282581475769e-06,
505
+ "loss": 0.2378,
506
+ "num_input_tokens_seen": 95424,
507
+ "step": 305
508
+ },
509
+ {
510
+ "epoch": 3.875,
511
+ "grad_norm": 0.8352380990982056,
512
+ "learning_rate": 9.7381457376998e-06,
513
+ "loss": 0.2402,
514
+ "num_input_tokens_seen": 96960,
515
+ "step": 310
516
+ },
517
+ {
518
+ "epoch": 3.9375,
519
+ "grad_norm": 1.431336760520935,
520
+ "learning_rate": 9.720445101963923e-06,
521
+ "loss": 0.2336,
522
+ "num_input_tokens_seen": 98528,
523
+ "step": 315
524
+ },
525
+ {
526
+ "epoch": 4.0,
527
+ "grad_norm": 0.8496450185775757,
528
+ "learning_rate": 9.702182780466775e-06,
529
+ "loss": 0.2424,
530
+ "num_input_tokens_seen": 100096,
531
+ "step": 320
532
+ },
533
+ {
534
+ "epoch": 4.0,
535
+ "eval_loss": 0.23625226318836212,
536
+ "eval_runtime": 0.9166,
537
+ "eval_samples_per_second": 87.277,
538
+ "eval_steps_per_second": 21.819,
539
+ "num_input_tokens_seen": 100096,
540
+ "step": 320
541
+ },
542
+ {
543
+ "epoch": 4.0625,
544
+ "grad_norm": 0.5525237321853638,
545
+ "learning_rate": 9.683360946241988e-06,
546
+ "loss": 0.2293,
547
+ "num_input_tokens_seen": 101664,
548
+ "step": 325
549
+ },
550
+ {
551
+ "epoch": 4.125,
552
+ "grad_norm": 1.4087265729904175,
553
+ "learning_rate": 9.663981838899612e-06,
554
+ "loss": 0.2239,
555
+ "num_input_tokens_seen": 103232,
556
+ "step": 330
557
+ },
558
+ {
559
+ "epoch": 4.1875,
560
+ "grad_norm": 0.8357235193252563,
561
+ "learning_rate": 9.644047764359623e-06,
562
+ "loss": 0.2219,
563
+ "num_input_tokens_seen": 104864,
564
+ "step": 335
565
+ },
566
+ {
567
+ "epoch": 4.25,
568
+ "grad_norm": 1.3893150091171265,
569
+ "learning_rate": 9.623561094577541e-06,
570
+ "loss": 0.2178,
571
+ "num_input_tokens_seen": 106496,
572
+ "step": 340
573
+ },
574
+ {
575
+ "epoch": 4.3125,
576
+ "grad_norm": 1.327685832977295,
577
+ "learning_rate": 9.602524267262202e-06,
578
+ "loss": 0.2373,
579
+ "num_input_tokens_seen": 108064,
580
+ "step": 345
581
+ },
582
+ {
583
+ "epoch": 4.375,
584
+ "grad_norm": 2.736804723739624,
585
+ "learning_rate": 9.58093978558568e-06,
586
+ "loss": 0.245,
587
+ "num_input_tokens_seen": 109600,
588
+ "step": 350
589
+ },
590
+ {
591
+ "epoch": 4.4375,
592
+ "grad_norm": 1.7261260747909546,
593
+ "learning_rate": 9.558810217885444e-06,
594
+ "loss": 0.2297,
595
+ "num_input_tokens_seen": 111200,
596
+ "step": 355
597
+ },
598
+ {
599
+ "epoch": 4.5,
600
+ "grad_norm": 1.1373929977416992,
601
+ "learning_rate": 9.536138197358747e-06,
602
+ "loss": 0.2371,
603
+ "num_input_tokens_seen": 112768,
604
+ "step": 360
605
+ },
606
+ {
607
+ "epoch": 4.5625,
608
+ "grad_norm": 0.8946912288665771,
609
+ "learning_rate": 9.512926421749305e-06,
610
+ "loss": 0.2346,
611
+ "num_input_tokens_seen": 114336,
612
+ "step": 365
613
+ },
614
+ {
615
+ "epoch": 4.625,
616
+ "grad_norm": 2.0754528045654297,
617
+ "learning_rate": 9.48917765302629e-06,
618
+ "loss": 0.2305,
619
+ "num_input_tokens_seen": 115872,
620
+ "step": 370
621
+ },
622
+ {
623
+ "epoch": 4.6875,
624
+ "grad_norm": 0.8747203350067139,
625
+ "learning_rate": 9.464894717055686e-06,
626
+ "loss": 0.2299,
627
+ "num_input_tokens_seen": 117440,
628
+ "step": 375
629
+ },
630
+ {
631
+ "epoch": 4.75,
632
+ "grad_norm": 1.415370225906372,
633
+ "learning_rate": 9.440080503264038e-06,
634
+ "loss": 0.2204,
635
+ "num_input_tokens_seen": 118976,
636
+ "step": 380
637
+ },
638
+ {
639
+ "epoch": 4.8125,
640
+ "grad_norm": 0.9907402396202087,
641
+ "learning_rate": 9.414737964294636e-06,
642
+ "loss": 0.2366,
643
+ "num_input_tokens_seen": 120544,
644
+ "step": 385
645
+ },
646
+ {
647
+ "epoch": 4.875,
648
+ "grad_norm": 2.0482561588287354,
649
+ "learning_rate": 9.388870115656185e-06,
650
+ "loss": 0.2541,
651
+ "num_input_tokens_seen": 122144,
652
+ "step": 390
653
+ },
654
+ {
655
+ "epoch": 4.9375,
656
+ "grad_norm": 1.3031115531921387,
657
+ "learning_rate": 9.362480035363987e-06,
658
+ "loss": 0.2257,
659
+ "num_input_tokens_seen": 123680,
660
+ "step": 395
661
+ },
662
+ {
663
+ "epoch": 5.0,
664
+ "grad_norm": 2.6485493183135986,
665
+ "learning_rate": 9.335570863573687e-06,
666
+ "loss": 0.235,
667
+ "num_input_tokens_seen": 125248,
668
+ "step": 400
669
+ },
670
+ {
671
+ "epoch": 5.0625,
672
+ "grad_norm": 2.2649688720703125,
673
+ "learning_rate": 9.30814580220763e-06,
674
+ "loss": 0.2452,
675
+ "num_input_tokens_seen": 126848,
676
+ "step": 405
677
+ },
678
+ {
679
+ "epoch": 5.125,
680
+ "grad_norm": 1.3141331672668457,
681
+ "learning_rate": 9.280208114573859e-06,
682
+ "loss": 0.2276,
683
+ "num_input_tokens_seen": 128448,
684
+ "step": 410
685
+ },
686
+ {
687
+ "epoch": 5.1875,
688
+ "grad_norm": 3.1463115215301514,
689
+ "learning_rate": 9.251761124977816e-06,
690
+ "loss": 0.2391,
691
+ "num_input_tokens_seen": 130048,
692
+ "step": 415
693
+ },
694
+ {
695
+ "epoch": 5.25,
696
+ "grad_norm": 1.473178744316101,
697
+ "learning_rate": 9.222808218326784e-06,
698
+ "loss": 0.2134,
699
+ "num_input_tokens_seen": 131616,
700
+ "step": 420
701
+ },
702
+ {
703
+ "epoch": 5.3125,
704
+ "grad_norm": 3.250842332839966,
705
+ "learning_rate": 9.193352839727122e-06,
706
+ "loss": 0.2114,
707
+ "num_input_tokens_seen": 133120,
708
+ "step": 425
709
+ },
710
+ {
711
+ "epoch": 5.375,
712
+ "grad_norm": 6.812272071838379,
713
+ "learning_rate": 9.163398494074314e-06,
714
+ "loss": 0.2555,
715
+ "num_input_tokens_seen": 134720,
716
+ "step": 430
717
+ },
718
+ {
719
+ "epoch": 5.4375,
720
+ "grad_norm": 5.645451068878174,
721
+ "learning_rate": 9.132948745635943e-06,
722
+ "loss": 0.2165,
723
+ "num_input_tokens_seen": 136288,
724
+ "step": 435
725
+ },
726
+ {
727
+ "epoch": 5.5,
728
+ "grad_norm": 2.791454315185547,
729
+ "learning_rate": 9.102007217627568e-06,
730
+ "loss": 0.2247,
731
+ "num_input_tokens_seen": 137920,
732
+ "step": 440
733
+ },
734
+ {
735
+ "epoch": 5.5625,
736
+ "grad_norm": 3.0683376789093018,
737
+ "learning_rate": 9.070577591781598e-06,
738
+ "loss": 0.2556,
739
+ "num_input_tokens_seen": 139488,
740
+ "step": 445
741
+ },
742
+ {
743
+ "epoch": 5.625,
744
+ "grad_norm": 1.0806782245635986,
745
+ "learning_rate": 9.038663607909198e-06,
746
+ "loss": 0.2418,
747
+ "num_input_tokens_seen": 141056,
748
+ "step": 450
749
+ },
750
+ {
751
+ "epoch": 5.6875,
752
+ "grad_norm": 3.1237070560455322,
753
+ "learning_rate": 9.006269063455305e-06,
754
+ "loss": 0.2406,
755
+ "num_input_tokens_seen": 142624,
756
+ "step": 455
757
+ },
758
+ {
759
+ "epoch": 5.75,
760
+ "grad_norm": 1.0548875331878662,
761
+ "learning_rate": 8.97339781304675e-06,
762
+ "loss": 0.2277,
763
+ "num_input_tokens_seen": 144128,
764
+ "step": 460
765
+ },
766
+ {
767
+ "epoch": 5.8125,
768
+ "grad_norm": 1.0157082080841064,
769
+ "learning_rate": 8.94005376803361e-06,
770
+ "loss": 0.2334,
771
+ "num_input_tokens_seen": 145696,
772
+ "step": 465
773
+ },
774
+ {
775
+ "epoch": 5.875,
776
+ "grad_norm": 0.5916211605072021,
777
+ "learning_rate": 8.906240896023794e-06,
778
+ "loss": 0.2352,
779
+ "num_input_tokens_seen": 147264,
780
+ "step": 470
781
+ },
782
+ {
783
+ "epoch": 5.9375,
784
+ "grad_norm": 1.234691858291626,
785
+ "learning_rate": 8.871963220410929e-06,
786
+ "loss": 0.2199,
787
+ "num_input_tokens_seen": 148864,
788
+ "step": 475
789
+ },
790
+ {
791
+ "epoch": 6.0,
792
+ "grad_norm": 1.6203759908676147,
793
+ "learning_rate": 8.837224819895627e-06,
794
+ "loss": 0.226,
795
+ "num_input_tokens_seen": 150400,
796
+ "step": 480
797
+ },
798
+ {
799
+ "epoch": 6.0,
800
+ "eval_loss": 0.24168923497200012,
801
+ "eval_runtime": 0.9165,
802
+ "eval_samples_per_second": 87.29,
803
+ "eval_steps_per_second": 21.823,
804
+ "num_input_tokens_seen": 150400,
805
+ "step": 480
806
+ },
807
+ {
808
+ "epoch": 6.0625,
809
+ "grad_norm": 2.004993438720703,
810
+ "learning_rate": 8.802029828000157e-06,
811
+ "loss": 0.2223,
812
+ "num_input_tokens_seen": 152000,
813
+ "step": 485
814
+ },
815
+ {
816
+ "epoch": 6.125,
817
+ "grad_norm": 3.092801332473755,
818
+ "learning_rate": 8.766382432576589e-06,
819
+ "loss": 0.237,
820
+ "num_input_tokens_seen": 153568,
821
+ "step": 490
822
+ },
823
+ {
824
+ "epoch": 6.1875,
825
+ "grad_norm": 1.6768417358398438,
826
+ "learning_rate": 8.730286875308498e-06,
827
+ "loss": 0.2362,
828
+ "num_input_tokens_seen": 155168,
829
+ "step": 495
830
+ },
831
+ {
832
+ "epoch": 6.25,
833
+ "grad_norm": 2.1835248470306396,
834
+ "learning_rate": 8.693747451206231e-06,
835
+ "loss": 0.2231,
836
+ "num_input_tokens_seen": 156704,
837
+ "step": 500
838
+ },
839
+ {
840
+ "epoch": 6.3125,
841
+ "grad_norm": 2.2564566135406494,
842
+ "learning_rate": 8.656768508095853e-06,
843
+ "loss": 0.2062,
844
+ "num_input_tokens_seen": 158176,
845
+ "step": 505
846
+ },
847
+ {
848
+ "epoch": 6.375,
849
+ "grad_norm": 1.0601657629013062,
850
+ "learning_rate": 8.61935444610179e-06,
851
+ "loss": 0.2769,
852
+ "num_input_tokens_seen": 159776,
853
+ "step": 510
854
+ },
855
+ {
856
+ "epoch": 6.4375,
857
+ "grad_norm": 0.9735612869262695,
858
+ "learning_rate": 8.581509717123272e-06,
859
+ "loss": 0.2179,
860
+ "num_input_tokens_seen": 161280,
861
+ "step": 515
862
+ },
863
+ {
864
+ "epoch": 6.5,
865
+ "grad_norm": 1.7658456563949585,
866
+ "learning_rate": 8.543238824304585e-06,
867
+ "loss": 0.2378,
868
+ "num_input_tokens_seen": 162880,
869
+ "step": 520
870
+ },
871
+ {
872
+ "epoch": 6.5625,
873
+ "grad_norm": 0.9902642369270325,
874
+ "learning_rate": 8.504546321499255e-06,
875
+ "loss": 0.2241,
876
+ "num_input_tokens_seen": 164448,
877
+ "step": 525
878
+ },
879
+ {
880
+ "epoch": 6.625,
881
+ "grad_norm": 2.589862585067749,
882
+ "learning_rate": 8.465436812728181e-06,
883
+ "loss": 0.2337,
884
+ "num_input_tokens_seen": 166048,
885
+ "step": 530
886
+ },
887
+ {
888
+ "epoch": 6.6875,
889
+ "grad_norm": 1.0508556365966797,
890
+ "learning_rate": 8.425914951631796e-06,
891
+ "loss": 0.2182,
892
+ "num_input_tokens_seen": 167616,
893
+ "step": 535
894
+ },
895
+ {
896
+ "epoch": 6.75,
897
+ "grad_norm": 1.600427269935608,
898
+ "learning_rate": 8.385985440916344e-06,
899
+ "loss": 0.2315,
900
+ "num_input_tokens_seen": 169184,
901
+ "step": 540
902
+ },
903
+ {
904
+ "epoch": 6.8125,
905
+ "grad_norm": 3.7025911808013916,
906
+ "learning_rate": 8.345653031794292e-06,
907
+ "loss": 0.2306,
908
+ "num_input_tokens_seen": 170816,
909
+ "step": 545
910
+ },
911
+ {
912
+ "epoch": 6.875,
913
+ "grad_norm": 1.6951884031295776,
914
+ "learning_rate": 8.304922523418988e-06,
915
+ "loss": 0.2025,
916
+ "num_input_tokens_seen": 172416,
917
+ "step": 550
918
+ },
919
+ {
920
+ "epoch": 6.9375,
921
+ "grad_norm": 2.9659578800201416,
922
+ "learning_rate": 8.263798762313613e-06,
923
+ "loss": 0.211,
924
+ "num_input_tokens_seen": 173952,
925
+ "step": 555
926
+ },
927
+ {
928
+ "epoch": 7.0,
929
+ "grad_norm": 4.885127067565918,
930
+ "learning_rate": 8.222286641794488e-06,
931
+ "loss": 0.2373,
932
+ "num_input_tokens_seen": 175520,
933
+ "step": 560
934
+ },
935
+ {
936
+ "epoch": 7.0625,
937
+ "grad_norm": 12.386910438537598,
938
+ "learning_rate": 8.18039110138882e-06,
939
+ "loss": 0.2171,
940
+ "num_input_tokens_seen": 177120,
941
+ "step": 565
942
+ },
943
+ {
944
+ "epoch": 7.125,
945
+ "grad_norm": 4.289523124694824,
946
+ "learning_rate": 8.138117126246951e-06,
947
+ "loss": 0.2133,
948
+ "num_input_tokens_seen": 178720,
949
+ "step": 570
950
+ },
951
+ {
952
+ "epoch": 7.1875,
953
+ "grad_norm": 3.2546839714050293,
954
+ "learning_rate": 8.095469746549172e-06,
955
+ "loss": 0.2257,
956
+ "num_input_tokens_seen": 180288,
957
+ "step": 575
958
+ },
959
+ {
960
+ "epoch": 7.25,
961
+ "grad_norm": 3.431429862976074,
962
+ "learning_rate": 8.052454036907174e-06,
963
+ "loss": 0.1958,
964
+ "num_input_tokens_seen": 181888,
965
+ "step": 580
966
+ },
967
+ {
968
+ "epoch": 7.3125,
969
+ "grad_norm": 2.91300106048584,
970
+ "learning_rate": 8.009075115760243e-06,
971
+ "loss": 0.2011,
972
+ "num_input_tokens_seen": 183456,
973
+ "step": 585
974
+ },
975
+ {
976
+ "epoch": 7.375,
977
+ "grad_norm": 3.276212453842163,
978
+ "learning_rate": 7.965338144766186e-06,
979
+ "loss": 0.2053,
980
+ "num_input_tokens_seen": 185024,
981
+ "step": 590
982
+ },
983
+ {
984
+ "epoch": 7.4375,
985
+ "grad_norm": 7.24328088760376,
986
+ "learning_rate": 7.921248328187174e-06,
987
+ "loss": 0.2549,
988
+ "num_input_tokens_seen": 186592,
989
+ "step": 595
990
+ },
991
+ {
992
+ "epoch": 7.5,
993
+ "grad_norm": 6.893558979034424,
994
+ "learning_rate": 7.876810912270462e-06,
995
+ "loss": 0.237,
996
+ "num_input_tokens_seen": 188128,
997
+ "step": 600
998
+ },
999
+ {
1000
+ "epoch": 7.5625,
1001
+ "grad_norm": 5.550233364105225,
1002
+ "learning_rate": 7.832031184624165e-06,
1003
+ "loss": 0.2381,
1004
+ "num_input_tokens_seen": 189664,
1005
+ "step": 605
1006
+ },
1007
+ {
1008
+ "epoch": 7.625,
1009
+ "grad_norm": 3.1067495346069336,
1010
+ "learning_rate": 7.786914473588057e-06,
1011
+ "loss": 0.1874,
1012
+ "num_input_tokens_seen": 191232,
1013
+ "step": 610
1014
+ },
1015
+ {
1016
+ "epoch": 7.6875,
1017
+ "grad_norm": 7.392305850982666,
1018
+ "learning_rate": 7.74146614759957e-06,
1019
+ "loss": 0.2148,
1020
+ "num_input_tokens_seen": 192800,
1021
+ "step": 615
1022
+ },
1023
+ {
1024
+ "epoch": 7.75,
1025
+ "grad_norm": 6.367889404296875,
1026
+ "learning_rate": 7.695691614555002e-06,
1027
+ "loss": 0.1938,
1028
+ "num_input_tokens_seen": 194368,
1029
+ "step": 620
1030
+ },
1031
+ {
1032
+ "epoch": 7.8125,
1033
+ "grad_norm": 3.6397085189819336,
1034
+ "learning_rate": 7.649596321166024e-06,
1035
+ "loss": 0.2583,
1036
+ "num_input_tokens_seen": 195936,
1037
+ "step": 625
1038
+ },
1039
+ {
1040
+ "epoch": 7.875,
1041
+ "grad_norm": 4.538295269012451,
1042
+ "learning_rate": 7.603185752311587e-06,
1043
+ "loss": 0.2361,
1044
+ "num_input_tokens_seen": 197568,
1045
+ "step": 630
1046
+ },
1047
+ {
1048
+ "epoch": 7.9375,
1049
+ "grad_norm": 2.7171273231506348,
1050
+ "learning_rate": 7.55646543038526e-06,
1051
+ "loss": 0.2024,
1052
+ "num_input_tokens_seen": 199136,
1053
+ "step": 635
1054
+ },
1055
+ {
1056
+ "epoch": 8.0,
1057
+ "grad_norm": 3.4454898834228516,
1058
+ "learning_rate": 7.50944091463814e-06,
1059
+ "loss": 0.203,
1060
+ "num_input_tokens_seen": 200704,
1061
+ "step": 640
1062
+ },
1063
+ {
1064
+ "epoch": 8.0,
1065
+ "eval_loss": 0.2646110951900482,
1066
+ "eval_runtime": 0.9235,
1067
+ "eval_samples_per_second": 86.628,
1068
+ "eval_steps_per_second": 21.657,
1069
+ "num_input_tokens_seen": 200704,
1070
+ "step": 640
1071
+ },
1072
+ {
1073
+ "epoch": 8.0625,
1074
+ "grad_norm": 9.628674507141113,
1075
+ "learning_rate": 7.462117800517337e-06,
1076
+ "loss": 0.2106,
1077
+ "num_input_tokens_seen": 202272,
1078
+ "step": 645
1079
+ },
1080
+ {
1081
+ "epoch": 8.125,
1082
+ "grad_norm": 4.698387622833252,
1083
+ "learning_rate": 7.414501719000187e-06,
1084
+ "loss": 0.1585,
1085
+ "num_input_tokens_seen": 203872,
1086
+ "step": 650
1087
+ },
1088
+ {
1089
+ "epoch": 8.1875,
1090
+ "grad_norm": 14.213072776794434,
1091
+ "learning_rate": 7.3665983359242175e-06,
1092
+ "loss": 0.1878,
1093
+ "num_input_tokens_seen": 205408,
1094
+ "step": 655
1095
+ },
1096
+ {
1097
+ "epoch": 8.25,
1098
+ "grad_norm": 18.2841854095459,
1099
+ "learning_rate": 7.318413351312965e-06,
1100
+ "loss": 0.2258,
1101
+ "num_input_tokens_seen": 207040,
1102
+ "step": 660
1103
+ },
1104
+ {
1105
+ "epoch": 8.3125,
1106
+ "grad_norm": 12.26921558380127,
1107
+ "learning_rate": 7.269952498697734e-06,
1108
+ "loss": 0.213,
1109
+ "num_input_tokens_seen": 208608,
1110
+ "step": 665
1111
+ },
1112
+ {
1113
+ "epoch": 8.375,
1114
+ "grad_norm": 14.23438549041748,
1115
+ "learning_rate": 7.221221544435364e-06,
1116
+ "loss": 0.2241,
1117
+ "num_input_tokens_seen": 210240,
1118
+ "step": 670
1119
+ },
1120
+ {
1121
+ "epoch": 8.4375,
1122
+ "grad_norm": 6.5942487716674805,
1123
+ "learning_rate": 7.172226287022086e-06,
1124
+ "loss": 0.2082,
1125
+ "num_input_tokens_seen": 211744,
1126
+ "step": 675
1127
+ },
1128
+ {
1129
+ "epoch": 8.5,
1130
+ "grad_norm": 6.377285957336426,
1131
+ "learning_rate": 7.1229725564035665e-06,
1132
+ "loss": 0.1792,
1133
+ "num_input_tokens_seen": 213280,
1134
+ "step": 680
1135
+ },
1136
+ {
1137
+ "epoch": 8.5625,
1138
+ "grad_norm": 5.1635966300964355,
1139
+ "learning_rate": 7.073466213281196e-06,
1140
+ "loss": 0.1922,
1141
+ "num_input_tokens_seen": 214816,
1142
+ "step": 685
1143
+ },
1144
+ {
1145
+ "epoch": 8.625,
1146
+ "grad_norm": 12.18478012084961,
1147
+ "learning_rate": 7.023713148414728e-06,
1148
+ "loss": 0.2235,
1149
+ "num_input_tokens_seen": 216416,
1150
+ "step": 690
1151
+ },
1152
+ {
1153
+ "epoch": 8.6875,
1154
+ "grad_norm": 5.056797504425049,
1155
+ "learning_rate": 6.973719281921336e-06,
1156
+ "loss": 0.2177,
1157
+ "num_input_tokens_seen": 217952,
1158
+ "step": 695
1159
+ },
1160
+ {
1161
+ "epoch": 8.75,
1162
+ "grad_norm": 6.05808162689209,
1163
+ "learning_rate": 6.9234905625711816e-06,
1164
+ "loss": 0.2071,
1165
+ "num_input_tokens_seen": 219552,
1166
+ "step": 700
1167
+ },
1168
+ {
1169
+ "epoch": 8.8125,
1170
+ "grad_norm": 4.5069499015808105,
1171
+ "learning_rate": 6.873032967079562e-06,
1172
+ "loss": 0.177,
1173
+ "num_input_tokens_seen": 221120,
1174
+ "step": 705
1175
+ },
1176
+ {
1177
+ "epoch": 8.875,
1178
+ "grad_norm": 9.237682342529297,
1179
+ "learning_rate": 6.822352499395751e-06,
1180
+ "loss": 0.188,
1181
+ "num_input_tokens_seen": 222656,
1182
+ "step": 710
1183
+ },
1184
+ {
1185
+ "epoch": 8.9375,
1186
+ "grad_norm": 9.70540714263916,
1187
+ "learning_rate": 6.771455189988579e-06,
1188
+ "loss": 0.1792,
1189
+ "num_input_tokens_seen": 224160,
1190
+ "step": 715
1191
+ },
1192
+ {
1193
+ "epoch": 9.0,
1194
+ "grad_norm": 29.520042419433594,
1195
+ "learning_rate": 6.720347095128884e-06,
1196
+ "loss": 0.248,
1197
+ "num_input_tokens_seen": 225728,
1198
+ "step": 720
1199
+ },
1200
+ {
1201
+ "epoch": 9.0625,
1202
+ "grad_norm": 6.013643741607666,
1203
+ "learning_rate": 6.669034296168855e-06,
1204
+ "loss": 0.1972,
1205
+ "num_input_tokens_seen": 227296,
1206
+ "step": 725
1207
+ },
1208
+ {
1209
+ "epoch": 9.125,
1210
+ "grad_norm": 8.753713607788086,
1211
+ "learning_rate": 6.617522898818426e-06,
1212
+ "loss": 0.1308,
1213
+ "num_input_tokens_seen": 228896,
1214
+ "step": 730
1215
+ },
1216
+ {
1217
+ "epoch": 9.1875,
1218
+ "grad_norm": 32.677330017089844,
1219
+ "learning_rate": 6.565819032418748e-06,
1220
+ "loss": 0.2039,
1221
+ "num_input_tokens_seen": 230464,
1222
+ "step": 735
1223
+ },
1224
+ {
1225
+ "epoch": 9.25,
1226
+ "grad_norm": 12.465975761413574,
1227
+ "learning_rate": 6.513928849212874e-06,
1228
+ "loss": 0.1759,
1229
+ "num_input_tokens_seen": 232032,
1230
+ "step": 740
1231
+ },
1232
+ {
1233
+ "epoch": 9.3125,
1234
+ "grad_norm": 25.805383682250977,
1235
+ "learning_rate": 6.461858523613684e-06,
1236
+ "loss": 0.1213,
1237
+ "num_input_tokens_seen": 233632,
1238
+ "step": 745
1239
+ },
1240
+ {
1241
+ "epoch": 9.375,
1242
+ "grad_norm": 10.997842788696289,
1243
+ "learning_rate": 6.4096142514692085e-06,
1244
+ "loss": 0.1275,
1245
+ "num_input_tokens_seen": 235200,
1246
+ "step": 750
1247
+ },
1248
+ {
1249
+ "epoch": 9.4375,
1250
+ "grad_norm": 16.847126007080078,
1251
+ "learning_rate": 6.3572022493253715e-06,
1252
+ "loss": 0.2011,
1253
+ "num_input_tokens_seen": 236704,
1254
+ "step": 755
1255
+ },
1256
+ {
1257
+ "epoch": 9.5,
1258
+ "grad_norm": 15.590519905090332,
1259
+ "learning_rate": 6.304628753686295e-06,
1260
+ "loss": 0.1765,
1261
+ "num_input_tokens_seen": 238240,
1262
+ "step": 760
1263
+ },
1264
+ {
1265
+ "epoch": 9.5625,
1266
+ "grad_norm": 8.250570297241211,
1267
+ "learning_rate": 6.251900020272208e-06,
1268
+ "loss": 0.1771,
1269
+ "num_input_tokens_seen": 239776,
1270
+ "step": 765
1271
+ },
1272
+ {
1273
+ "epoch": 9.625,
1274
+ "grad_norm": 11.40953254699707,
1275
+ "learning_rate": 6.199022323275083e-06,
1276
+ "loss": 0.1629,
1277
+ "num_input_tokens_seen": 241280,
1278
+ "step": 770
1279
+ },
1280
+ {
1281
+ "epoch": 9.6875,
1282
+ "grad_norm": 10.189095497131348,
1283
+ "learning_rate": 6.146001954612072e-06,
1284
+ "loss": 0.1316,
1285
+ "num_input_tokens_seen": 242752,
1286
+ "step": 775
1287
+ },
1288
+ {
1289
+ "epoch": 9.75,
1290
+ "grad_norm": 15.832845687866211,
1291
+ "learning_rate": 6.092845223176823e-06,
1292
+ "loss": 0.1367,
1293
+ "num_input_tokens_seen": 244352,
1294
+ "step": 780
1295
+ },
1296
+ {
1297
+ "epoch": 9.8125,
1298
+ "grad_norm": 14.90906810760498,
1299
+ "learning_rate": 6.039558454088796e-06,
1300
+ "loss": 0.1887,
1301
+ "num_input_tokens_seen": 245984,
1302
+ "step": 785
1303
+ },
1304
+ {
1305
+ "epoch": 9.875,
1306
+ "grad_norm": 10.249608039855957,
1307
+ "learning_rate": 5.986147987940632e-06,
1308
+ "loss": 0.1589,
1309
+ "num_input_tokens_seen": 247552,
1310
+ "step": 790
1311
+ },
1312
+ {
1313
+ "epoch": 9.9375,
1314
+ "grad_norm": 5.563095569610596,
1315
+ "learning_rate": 5.932620180043674e-06,
1316
+ "loss": 0.1707,
1317
+ "num_input_tokens_seen": 249088,
1318
+ "step": 795
1319
+ },
1320
+ {
1321
+ "epoch": 10.0,
1322
+ "grad_norm": 9.259960174560547,
1323
+ "learning_rate": 5.878981399671774e-06,
1324
+ "loss": 0.2158,
1325
+ "num_input_tokens_seen": 250592,
1326
+ "step": 800
1327
+ },
1328
+ {
1329
+ "epoch": 10.0,
1330
+ "eval_loss": 0.3806591033935547,
1331
+ "eval_runtime": 0.9182,
1332
+ "eval_samples_per_second": 87.124,
1333
+ "eval_steps_per_second": 21.781,
1334
+ "num_input_tokens_seen": 250592,
1335
+ "step": 800
1336
+ },
1337
+ {
1338
+ "epoch": 10.0625,
1339
+ "grad_norm": 10.863539695739746,
1340
+ "learning_rate": 5.825238029303388e-06,
1341
+ "loss": 0.0837,
1342
+ "num_input_tokens_seen": 252256,
1343
+ "step": 805
1344
+ },
1345
+ {
1346
+ "epoch": 10.125,
1347
+ "grad_norm": 8.517024993896484,
1348
+ "learning_rate": 5.771396463862145e-06,
1349
+ "loss": 0.0796,
1350
+ "num_input_tokens_seen": 253824,
1351
+ "step": 810
1352
+ },
1353
+ {
1354
+ "epoch": 10.1875,
1355
+ "grad_norm": 9.531097412109375,
1356
+ "learning_rate": 5.717463109955896e-06,
1357
+ "loss": 0.0816,
1358
+ "num_input_tokens_seen": 255360,
1359
+ "step": 815
1360
+ },
1361
+ {
1362
+ "epoch": 10.25,
1363
+ "grad_norm": 20.906747817993164,
1364
+ "learning_rate": 5.6634443851144115e-06,
1365
+ "loss": 0.0815,
1366
+ "num_input_tokens_seen": 256960,
1367
+ "step": 820
1368
+ },
1369
+ {
1370
+ "epoch": 10.3125,
1371
+ "grad_norm": 21.023405075073242,
1372
+ "learning_rate": 5.609346717025738e-06,
1373
+ "loss": 0.1234,
1374
+ "num_input_tokens_seen": 258528,
1375
+ "step": 825
1376
+ },
1377
+ {
1378
+ "epoch": 10.375,
1379
+ "grad_norm": 34.56439208984375,
1380
+ "learning_rate": 5.555176542771389e-06,
1381
+ "loss": 0.1462,
1382
+ "num_input_tokens_seen": 260096,
1383
+ "step": 830
1384
+ },
1385
+ {
1386
+ "epoch": 10.4375,
1387
+ "grad_norm": 16.453866958618164,
1388
+ "learning_rate": 5.500940308060382e-06,
1389
+ "loss": 0.1031,
1390
+ "num_input_tokens_seen": 261632,
1391
+ "step": 835
1392
+ },
1393
+ {
1394
+ "epoch": 10.5,
1395
+ "grad_norm": 20.025487899780273,
1396
+ "learning_rate": 5.446644466462269e-06,
1397
+ "loss": 0.1215,
1398
+ "num_input_tokens_seen": 263232,
1399
+ "step": 840
1400
+ },
1401
+ {
1402
+ "epoch": 10.5625,
1403
+ "grad_norm": 21.75179672241211,
1404
+ "learning_rate": 5.392295478639226e-06,
1405
+ "loss": 0.2008,
1406
+ "num_input_tokens_seen": 264832,
1407
+ "step": 845
1408
+ },
1409
+ {
1410
+ "epoch": 10.625,
1411
+ "grad_norm": 26.04926300048828,
1412
+ "learning_rate": 5.337899811577297e-06,
1413
+ "loss": 0.1439,
1414
+ "num_input_tokens_seen": 266432,
1415
+ "step": 850
1416
+ },
1417
+ {
1418
+ "epoch": 10.6875,
1419
+ "grad_norm": 6.14185094833374,
1420
+ "learning_rate": 5.283463937816888e-06,
1421
+ "loss": 0.0927,
1422
+ "num_input_tokens_seen": 268000,
1423
+ "step": 855
1424
+ },
1425
+ {
1426
+ "epoch": 10.75,
1427
+ "grad_norm": 7.8607635498046875,
1428
+ "learning_rate": 5.228994334682605e-06,
1429
+ "loss": 0.1166,
1430
+ "num_input_tokens_seen": 269568,
1431
+ "step": 860
1432
+ },
1433
+ {
1434
+ "epoch": 10.8125,
1435
+ "grad_norm": 14.155070304870605,
1436
+ "learning_rate": 5.174497483512506e-06,
1437
+ "loss": 0.1066,
1438
+ "num_input_tokens_seen": 271104,
1439
+ "step": 865
1440
+ },
1441
+ {
1442
+ "epoch": 10.875,
1443
+ "grad_norm": 14.761529922485352,
1444
+ "learning_rate": 5.1199798688868955e-06,
1445
+ "loss": 0.1839,
1446
+ "num_input_tokens_seen": 272640,
1447
+ "step": 870
1448
+ },
1449
+ {
1450
+ "epoch": 10.9375,
1451
+ "grad_norm": 86.08441162109375,
1452
+ "learning_rate": 5.065447977856723e-06,
1453
+ "loss": 0.2267,
1454
+ "num_input_tokens_seen": 274208,
1455
+ "step": 875
1456
+ },
1457
+ {
1458
+ "epoch": 11.0,
1459
+ "grad_norm": 6.295602798461914,
1460
+ "learning_rate": 5.010908299171685e-06,
1461
+ "loss": 0.195,
1462
+ "num_input_tokens_seen": 275776,
1463
+ "step": 880
1464
+ },
1465
+ {
1466
+ "epoch": 11.0625,
1467
+ "grad_norm": 29.30211067199707,
1468
+ "learning_rate": 4.956367322508131e-06,
1469
+ "loss": 0.0878,
1470
+ "num_input_tokens_seen": 277344,
1471
+ "step": 885
1472
+ },
1473
+ {
1474
+ "epoch": 11.125,
1475
+ "grad_norm": 36.51382827758789,
1476
+ "learning_rate": 4.90183153769686e-06,
1477
+ "loss": 0.097,
1478
+ "num_input_tokens_seen": 278944,
1479
+ "step": 890
1480
+ },
1481
+ {
1482
+ "epoch": 11.1875,
1483
+ "grad_norm": 24.661712646484375,
1484
+ "learning_rate": 4.847307433950888e-06,
1485
+ "loss": 0.1409,
1486
+ "num_input_tokens_seen": 280480,
1487
+ "step": 895
1488
+ },
1489
+ {
1490
+ "epoch": 11.25,
1491
+ "grad_norm": 15.945549011230469,
1492
+ "learning_rate": 4.792801499093305e-06,
1493
+ "loss": 0.0621,
1494
+ "num_input_tokens_seen": 282048,
1495
+ "step": 900
1496
+ },
1497
+ {
1498
+ "epoch": 11.3125,
1499
+ "grad_norm": 6.951054573059082,
1500
+ "learning_rate": 4.738320218785281e-06,
1501
+ "loss": 0.1243,
1502
+ "num_input_tokens_seen": 283584,
1503
+ "step": 905
1504
+ },
1505
+ {
1506
+ "epoch": 11.375,
1507
+ "grad_norm": 10.031401634216309,
1508
+ "learning_rate": 4.683870075754347e-06,
1509
+ "loss": 0.0546,
1510
+ "num_input_tokens_seen": 285216,
1511
+ "step": 910
1512
+ },
1513
+ {
1514
+ "epoch": 11.4375,
1515
+ "grad_norm": 10.544556617736816,
1516
+ "learning_rate": 4.629457549023004e-06,
1517
+ "loss": 0.0756,
1518
+ "num_input_tokens_seen": 286784,
1519
+ "step": 915
1520
+ },
1521
+ {
1522
+ "epoch": 11.5,
1523
+ "grad_norm": 13.169859886169434,
1524
+ "learning_rate": 4.575089113137792e-06,
1525
+ "loss": 0.0427,
1526
+ "num_input_tokens_seen": 288352,
1527
+ "step": 920
1528
+ },
1529
+ {
1530
+ "epoch": 11.5625,
1531
+ "grad_norm": 9.484841346740723,
1532
+ "learning_rate": 4.52077123739888e-06,
1533
+ "loss": 0.1803,
1534
+ "num_input_tokens_seen": 289920,
1535
+ "step": 925
1536
+ },
1537
+ {
1538
+ "epoch": 11.625,
1539
+ "grad_norm": 32.46955871582031,
1540
+ "learning_rate": 4.466510385090287e-06,
1541
+ "loss": 0.1306,
1542
+ "num_input_tokens_seen": 291520,
1543
+ "step": 930
1544
+ },
1545
+ {
1546
+ "epoch": 11.6875,
1547
+ "grad_norm": 13.81043529510498,
1548
+ "learning_rate": 4.4123130127108125e-06,
1549
+ "loss": 0.0687,
1550
+ "num_input_tokens_seen": 293056,
1551
+ "step": 935
1552
+ },
1553
+ {
1554
+ "epoch": 11.75,
1555
+ "grad_norm": 52.68048858642578,
1556
+ "learning_rate": 4.358185569205779e-06,
1557
+ "loss": 0.1427,
1558
+ "num_input_tokens_seen": 294624,
1559
+ "step": 940
1560
+ },
1561
+ {
1562
+ "epoch": 11.8125,
1563
+ "grad_norm": 14.252890586853027,
1564
+ "learning_rate": 4.304134495199675e-06,
1565
+ "loss": 0.052,
1566
+ "num_input_tokens_seen": 296160,
1567
+ "step": 945
1568
+ },
1569
+ {
1570
+ "epoch": 11.875,
1571
+ "grad_norm": 28.040077209472656,
1572
+ "learning_rate": 4.250166222229775e-06,
1573
+ "loss": 0.1382,
1574
+ "num_input_tokens_seen": 297696,
1575
+ "step": 950
1576
+ },
1577
+ {
1578
+ "epoch": 11.9375,
1579
+ "grad_norm": 10.452444076538086,
1580
+ "learning_rate": 4.196287171980869e-06,
1581
+ "loss": 0.0832,
1582
+ "num_input_tokens_seen": 299296,
1583
+ "step": 955
1584
+ },
1585
+ {
1586
+ "epoch": 12.0,
1587
+ "grad_norm": 0.764223575592041,
1588
+ "learning_rate": 4.142503755521129e-06,
1589
+ "loss": 0.013,
1590
+ "num_input_tokens_seen": 300832,
1591
+ "step": 960
1592
+ },
1593
+ {
1594
+ "epoch": 12.0,
1595
+ "eval_loss": 0.5269124507904053,
1596
+ "eval_runtime": 0.9175,
1597
+ "eval_samples_per_second": 87.197,
1598
+ "eval_steps_per_second": 21.799,
1599
+ "num_input_tokens_seen": 300832,
1600
+ "step": 960
1601
+ },
1602
+ {
1603
+ "epoch": 12.0625,
1604
+ "grad_norm": 31.75777816772461,
1605
+ "learning_rate": 4.088822372539263e-06,
1606
+ "loss": 0.0538,
1607
+ "num_input_tokens_seen": 302368,
1608
+ "step": 965
1609
+ },
1610
+ {
1611
+ "epoch": 12.125,
1612
+ "grad_norm": 1.6154866218566895,
1613
+ "learning_rate": 4.0352494105830155e-06,
1614
+ "loss": 0.0165,
1615
+ "num_input_tokens_seen": 303936,
1616
+ "step": 970
1617
+ },
1618
+ {
1619
+ "epoch": 12.1875,
1620
+ "grad_norm": 3.850043296813965,
1621
+ "learning_rate": 3.981791244299113e-06,
1622
+ "loss": 0.0048,
1623
+ "num_input_tokens_seen": 305536,
1624
+ "step": 975
1625
+ },
1626
+ {
1627
+ "epoch": 12.25,
1628
+ "grad_norm": 48.39171600341797,
1629
+ "learning_rate": 3.928454234674748e-06,
1630
+ "loss": 0.0607,
1631
+ "num_input_tokens_seen": 307136,
1632
+ "step": 980
1633
+ },
1634
+ {
1635
+ "epoch": 12.3125,
1636
+ "grad_norm": 2.30706787109375,
1637
+ "learning_rate": 3.875244728280676e-06,
1638
+ "loss": 0.0606,
1639
+ "num_input_tokens_seen": 308672,
1640
+ "step": 985
1641
+ },
1642
+ {
1643
+ "epoch": 12.375,
1644
+ "grad_norm": 39.70808029174805,
1645
+ "learning_rate": 3.822169056516051e-06,
1646
+ "loss": 0.0663,
1647
+ "num_input_tokens_seen": 310272,
1648
+ "step": 990
1649
+ },
1650
+ {
1651
+ "epoch": 12.4375,
1652
+ "grad_norm": 28.03505516052246,
1653
+ "learning_rate": 3.769233534855035e-06,
1654
+ "loss": 0.0959,
1655
+ "num_input_tokens_seen": 311840,
1656
+ "step": 995
1657
+ },
1658
+ {
1659
+ "epoch": 12.5,
1660
+ "grad_norm": 47.5093879699707,
1661
+ "learning_rate": 3.7164444620953397e-06,
1662
+ "loss": 0.0328,
1663
+ "num_input_tokens_seen": 313376,
1664
+ "step": 1000
1665
+ },
1666
+ {
1667
+ "epoch": 12.5625,
1668
+ "grad_norm": 73.91315460205078,
1669
+ "learning_rate": 3.663808119608716e-06,
1670
+ "loss": 0.1293,
1671
+ "num_input_tokens_seen": 314976,
1672
+ "step": 1005
1673
+ },
1674
+ {
1675
+ "epoch": 12.625,
1676
+ "grad_norm": 60.9661865234375,
1677
+ "learning_rate": 3.6113307705935398e-06,
1678
+ "loss": 0.1331,
1679
+ "num_input_tokens_seen": 316608,
1680
+ "step": 1010
1681
+ },
1682
+ {
1683
+ "epoch": 12.6875,
1684
+ "grad_norm": 34.23344421386719,
1685
+ "learning_rate": 3.559018659329554e-06,
1686
+ "loss": 0.0214,
1687
+ "num_input_tokens_seen": 318176,
1688
+ "step": 1015
1689
+ },
1690
+ {
1691
+ "epoch": 12.75,
1692
+ "grad_norm": 68.91053771972656,
1693
+ "learning_rate": 3.5068780104348632e-06,
1694
+ "loss": 0.2041,
1695
+ "num_input_tokens_seen": 319744,
1696
+ "step": 1020
1697
+ },
1698
+ {
1699
+ "epoch": 12.8125,
1700
+ "grad_norm": 12.771360397338867,
1701
+ "learning_rate": 3.4549150281252635e-06,
1702
+ "loss": 0.0461,
1703
+ "num_input_tokens_seen": 321312,
1704
+ "step": 1025
1705
+ },
1706
+ {
1707
+ "epoch": 12.875,
1708
+ "grad_norm": 11.279967308044434,
1709
+ "learning_rate": 3.403135895476004e-06,
1710
+ "loss": 0.0526,
1711
+ "num_input_tokens_seen": 322816,
1712
+ "step": 1030
1713
+ },
1714
+ {
1715
+ "epoch": 12.9375,
1716
+ "grad_norm": 36.483665466308594,
1717
+ "learning_rate": 3.351546773686065e-06,
1718
+ "loss": 0.0837,
1719
+ "num_input_tokens_seen": 324352,
1720
+ "step": 1035
1721
+ },
1722
+ {
1723
+ "epoch": 13.0,
1724
+ "grad_norm": 5.231341361999512,
1725
+ "learning_rate": 3.3001538013450285e-06,
1726
+ "loss": 0.0695,
1727
+ "num_input_tokens_seen": 325888,
1728
+ "step": 1040
1729
+ },
1730
+ {
1731
+ "epoch": 13.0625,
1732
+ "grad_norm": 0.3779163956642151,
1733
+ "learning_rate": 3.248963093702663e-06,
1734
+ "loss": 0.0415,
1735
+ "num_input_tokens_seen": 327456,
1736
+ "step": 1045
1737
+ },
1738
+ {
1739
+ "epoch": 13.125,
1740
+ "grad_norm": 2.168973207473755,
1741
+ "learning_rate": 3.1979807419412523e-06,
1742
+ "loss": 0.013,
1743
+ "num_input_tokens_seen": 329056,
1744
+ "step": 1050
1745
+ },
1746
+ {
1747
+ "epoch": 13.1875,
1748
+ "grad_norm": 9.570158958435059,
1749
+ "learning_rate": 3.147212812450819e-06,
1750
+ "loss": 0.0101,
1751
+ "num_input_tokens_seen": 330624,
1752
+ "step": 1055
1753
+ },
1754
+ {
1755
+ "epoch": 13.25,
1756
+ "grad_norm": 13.025811195373535,
1757
+ "learning_rate": 3.0966653461072778e-06,
1758
+ "loss": 0.005,
1759
+ "num_input_tokens_seen": 332192,
1760
+ "step": 1060
1761
+ },
1762
+ {
1763
+ "epoch": 13.3125,
1764
+ "grad_norm": 0.06026485562324524,
1765
+ "learning_rate": 3.0463443575536324e-06,
1766
+ "loss": 0.0222,
1767
+ "num_input_tokens_seen": 333760,
1768
+ "step": 1065
1769
+ },
1770
+ {
1771
+ "epoch": 13.375,
1772
+ "grad_norm": 2.276689052581787,
1773
+ "learning_rate": 2.9962558344842963e-06,
1774
+ "loss": 0.05,
1775
+ "num_input_tokens_seen": 335328,
1776
+ "step": 1070
1777
+ },
1778
+ {
1779
+ "epoch": 13.4375,
1780
+ "grad_norm": 0.11900363117456436,
1781
+ "learning_rate": 2.946405736932615e-06,
1782
+ "loss": 0.0244,
1783
+ "num_input_tokens_seen": 336864,
1784
+ "step": 1075
1785
+ },
1786
+ {
1787
+ "epoch": 13.5,
1788
+ "grad_norm": 2.8280792236328125,
1789
+ "learning_rate": 2.8967999965616815e-06,
1790
+ "loss": 0.0222,
1791
+ "num_input_tokens_seen": 338400,
1792
+ "step": 1080
1793
+ },
1794
+ {
1795
+ "epoch": 13.5625,
1796
+ "grad_norm": 15.377069473266602,
1797
+ "learning_rate": 2.8474445159585235e-06,
1798
+ "loss": 0.0272,
1799
+ "num_input_tokens_seen": 339968,
1800
+ "step": 1085
1801
+ },
1802
+ {
1803
+ "epoch": 13.625,
1804
+ "grad_norm": 7.599987983703613,
1805
+ "learning_rate": 2.798345167931771e-06,
1806
+ "loss": 0.0055,
1807
+ "num_input_tokens_seen": 341568,
1808
+ "step": 1090
1809
+ },
1810
+ {
1811
+ "epoch": 13.6875,
1812
+ "grad_norm": 2.0365688800811768,
1813
+ "learning_rate": 2.7495077948128245e-06,
1814
+ "loss": 0.0058,
1815
+ "num_input_tokens_seen": 343168,
1816
+ "step": 1095
1817
+ },
1818
+ {
1819
+ "epoch": 13.75,
1820
+ "grad_norm": 40.94605255126953,
1821
+ "learning_rate": 2.700938207760701e-06,
1822
+ "loss": 0.0186,
1823
+ "num_input_tokens_seen": 344704,
1824
+ "step": 1100
1825
+ },
1826
+ {
1827
+ "epoch": 13.8125,
1828
+ "grad_norm": 0.5802567601203918,
1829
+ "learning_rate": 2.6526421860705474e-06,
1830
+ "loss": 0.002,
1831
+ "num_input_tokens_seen": 346272,
1832
+ "step": 1105
1833
+ },
1834
+ {
1835
+ "epoch": 13.875,
1836
+ "grad_norm": 2.4557807445526123,
1837
+ "learning_rate": 2.6046254764859687e-06,
1838
+ "loss": 0.0125,
1839
+ "num_input_tokens_seen": 347808,
1840
+ "step": 1110
1841
+ },
1842
+ {
1843
+ "epoch": 13.9375,
1844
+ "grad_norm": 103.32002258300781,
1845
+ "learning_rate": 2.5568937925152272e-06,
1846
+ "loss": 0.05,
1847
+ "num_input_tokens_seen": 349376,
1848
+ "step": 1115
1849
+ },
1850
+ {
1851
+ "epoch": 14.0,
1852
+ "grad_norm": 0.0683021992444992,
1853
+ "learning_rate": 2.5094528137513797e-06,
1854
+ "loss": 0.0174,
1855
+ "num_input_tokens_seen": 350976,
1856
+ "step": 1120
1857
+ },
1858
+ {
1859
+ "epoch": 14.0,
1860
+ "eval_loss": 0.8447664380073547,
1861
+ "eval_runtime": 0.9205,
1862
+ "eval_samples_per_second": 86.907,
1863
+ "eval_steps_per_second": 21.727,
1864
+ "num_input_tokens_seen": 350976,
1865
+ "step": 1120
1866
+ },
1867
+ {
1868
+ "epoch": 14.0625,
1869
+ "grad_norm": 0.2838919758796692,
1870
+ "learning_rate": 2.462308185196481e-06,
1871
+ "loss": 0.0008,
1872
+ "num_input_tokens_seen": 352512,
1873
+ "step": 1125
1874
+ },
1875
+ {
1876
+ "epoch": 14.125,
1877
+ "grad_norm": 0.25671952962875366,
1878
+ "learning_rate": 2.4154655165898626e-06,
1879
+ "loss": 0.0009,
1880
+ "num_input_tokens_seen": 354048,
1881
+ "step": 1130
1882
+ },
1883
+ {
1884
+ "epoch": 14.1875,
1885
+ "grad_norm": 0.08236575126647949,
1886
+ "learning_rate": 2.3689303817406523e-06,
1887
+ "loss": 0.0019,
1888
+ "num_input_tokens_seen": 355584,
1889
+ "step": 1135
1890
+ },
1891
+ {
1892
+ "epoch": 14.25,
1893
+ "grad_norm": 0.2644610106945038,
1894
+ "learning_rate": 2.3227083178645316e-06,
1895
+ "loss": 0.0053,
1896
+ "num_input_tokens_seen": 357152,
1897
+ "step": 1140
1898
+ },
1899
+ {
1900
+ "epoch": 14.3125,
1901
+ "grad_norm": 0.06165740266442299,
1902
+ "learning_rate": 2.2768048249248648e-06,
1903
+ "loss": 0.0109,
1904
+ "num_input_tokens_seen": 358752,
1905
+ "step": 1145
1906
+ },
1907
+ {
1908
+ "epoch": 14.375,
1909
+ "grad_norm": 0.31678006052970886,
1910
+ "learning_rate": 2.2312253649782655e-06,
1911
+ "loss": 0.0006,
1912
+ "num_input_tokens_seen": 360288,
1913
+ "step": 1150
1914
+ },
1915
+ {
1916
+ "epoch": 14.4375,
1917
+ "grad_norm": 0.9465001225471497,
1918
+ "learning_rate": 2.185975361524657e-06,
1919
+ "loss": 0.0722,
1920
+ "num_input_tokens_seen": 361856,
1921
+ "step": 1155
1922
+ },
1923
+ {
1924
+ "epoch": 14.5,
1925
+ "grad_norm": 2.73948335647583,
1926
+ "learning_rate": 2.1410601988619394e-06,
1927
+ "loss": 0.0038,
1928
+ "num_input_tokens_seen": 363392,
1929
+ "step": 1160
1930
+ },
1931
+ {
1932
+ "epoch": 14.5625,
1933
+ "grad_norm": 0.18431881070137024,
1934
+ "learning_rate": 2.096485221445301e-06,
1935
+ "loss": 0.0063,
1936
+ "num_input_tokens_seen": 364960,
1937
+ "step": 1165
1938
+ },
1939
+ {
1940
+ "epoch": 14.625,
1941
+ "grad_norm": 6.978606700897217,
1942
+ "learning_rate": 2.0522557332512953e-06,
1943
+ "loss": 0.0038,
1944
+ "num_input_tokens_seen": 366528,
1945
+ "step": 1170
1946
+ },
1947
+ {
1948
+ "epoch": 14.6875,
1949
+ "grad_norm": 0.9169700145721436,
1950
+ "learning_rate": 2.008376997146705e-06,
1951
+ "loss": 0.0268,
1952
+ "num_input_tokens_seen": 368032,
1953
+ "step": 1175
1954
+ },
1955
+ {
1956
+ "epoch": 14.75,
1957
+ "grad_norm": 0.3377976417541504,
1958
+ "learning_rate": 1.9648542342623276e-06,
1959
+ "loss": 0.0022,
1960
+ "num_input_tokens_seen": 369632,
1961
+ "step": 1180
1962
+ },
1963
+ {
1964
+ "epoch": 14.8125,
1965
+ "grad_norm": 0.23450040817260742,
1966
+ "learning_rate": 1.9216926233717087e-06,
1967
+ "loss": 0.0003,
1968
+ "num_input_tokens_seen": 371232,
1969
+ "step": 1185
1970
+ },
1971
+ {
1972
+ "epoch": 14.875,
1973
+ "grad_norm": 17.118005752563477,
1974
+ "learning_rate": 1.8788973002749112e-06,
1975
+ "loss": 0.0058,
1976
+ "num_input_tokens_seen": 372800,
1977
+ "step": 1190
1978
+ },
1979
+ {
1980
+ "epoch": 14.9375,
1981
+ "grad_norm": 23.535619735717773,
1982
+ "learning_rate": 1.83647335718742e-06,
1983
+ "loss": 0.008,
1984
+ "num_input_tokens_seen": 374368,
1985
+ "step": 1195
1986
+ },
1987
+ {
1988
+ "epoch": 15.0,
1989
+ "grad_norm": 0.2620501220226288,
1990
+ "learning_rate": 1.7944258421342097e-06,
1991
+ "loss": 0.0006,
1992
+ "num_input_tokens_seen": 376000,
1993
+ "step": 1200
1994
+ },
1995
+ {
1996
+ "epoch": 15.0625,
1997
+ "grad_norm": 0.06305874139070511,
1998
+ "learning_rate": 1.7527597583490825e-06,
1999
+ "loss": 0.0076,
2000
+ "num_input_tokens_seen": 377568,
2001
+ "step": 1205
2002
+ },
2003
+ {
2004
+ "epoch": 15.125,
2005
+ "grad_norm": 0.17243485152721405,
2006
+ "learning_rate": 1.7114800636793378e-06,
2007
+ "loss": 0.0003,
2008
+ "num_input_tokens_seen": 379072,
2009
+ "step": 1210
2010
+ },
2011
+ {
2012
+ "epoch": 15.1875,
2013
+ "grad_norm": 109.47997283935547,
2014
+ "learning_rate": 1.6705916699958292e-06,
2015
+ "loss": 0.0422,
2016
+ "num_input_tokens_seen": 380640,
2017
+ "step": 1215
2018
+ },
2019
+ {
2020
+ "epoch": 15.25,
2021
+ "grad_norm": 13.874452590942383,
2022
+ "learning_rate": 1.6300994426085103e-06,
2023
+ "loss": 0.0015,
2024
+ "num_input_tokens_seen": 382240,
2025
+ "step": 1220
2026
+ },
2027
+ {
2028
+ "epoch": 15.3125,
2029
+ "grad_norm": 0.06081826239824295,
2030
+ "learning_rate": 1.5900081996875083e-06,
2031
+ "loss": 0.0003,
2032
+ "num_input_tokens_seen": 383872,
2033
+ "step": 1225
2034
+ },
2035
+ {
2036
+ "epoch": 15.375,
2037
+ "grad_norm": 0.1609436422586441,
2038
+ "learning_rate": 1.5503227116898017e-06,
2039
+ "loss": 0.0004,
2040
+ "num_input_tokens_seen": 385472,
2041
+ "step": 1230
2042
+ },
2043
+ {
2044
+ "epoch": 15.4375,
2045
+ "grad_norm": 4.266076564788818,
2046
+ "learning_rate": 1.5110477007916002e-06,
2047
+ "loss": 0.0012,
2048
+ "num_input_tokens_seen": 387072,
2049
+ "step": 1235
2050
+ },
2051
+ {
2052
+ "epoch": 15.5,
2053
+ "grad_norm": 0.06309761852025986,
2054
+ "learning_rate": 1.4721878403264344e-06,
2055
+ "loss": 0.0006,
2056
+ "num_input_tokens_seen": 388640,
2057
+ "step": 1240
2058
+ },
2059
+ {
2060
+ "epoch": 15.5625,
2061
+ "grad_norm": 0.2793193459510803,
2062
+ "learning_rate": 1.433747754229093e-06,
2063
+ "loss": 0.0009,
2064
+ "num_input_tokens_seen": 390240,
2065
+ "step": 1245
2066
+ },
2067
+ {
2068
+ "epoch": 15.625,
2069
+ "grad_norm": 0.028820207342505455,
2070
+ "learning_rate": 1.395732016485406e-06,
2071
+ "loss": 0.0006,
2072
+ "num_input_tokens_seen": 391808,
2073
+ "step": 1250
2074
+ },
2075
+ {
2076
+ "epoch": 15.6875,
2077
+ "grad_norm": 0.11917386204004288,
2078
+ "learning_rate": 1.3581451505879995e-06,
2079
+ "loss": 0.0002,
2080
+ "num_input_tokens_seen": 393408,
2081
+ "step": 1255
2082
+ },
2083
+ {
2084
+ "epoch": 15.75,
2085
+ "grad_norm": 0.14404775202274323,
2086
+ "learning_rate": 1.3209916289980336e-06,
2087
+ "loss": 0.0004,
2088
+ "num_input_tokens_seen": 394976,
2089
+ "step": 1260
2090
+ },
2091
+ {
2092
+ "epoch": 15.8125,
2093
+ "grad_norm": 0.47015178203582764,
2094
+ "learning_rate": 1.2842758726130283e-06,
2095
+ "loss": 0.0003,
2096
+ "num_input_tokens_seen": 396448,
2097
+ "step": 1265
2098
+ },
2099
+ {
2100
+ "epoch": 15.875,
2101
+ "grad_norm": 37.57123565673828,
2102
+ "learning_rate": 1.2480022502408306e-06,
2103
+ "loss": 0.0103,
2104
+ "num_input_tokens_seen": 398048,
2105
+ "step": 1270
2106
+ },
2107
+ {
2108
+ "epoch": 15.9375,
2109
+ "grad_norm": 0.09548679739236832,
2110
+ "learning_rate": 1.2121750780797514e-06,
2111
+ "loss": 0.0006,
2112
+ "num_input_tokens_seen": 399648,
2113
+ "step": 1275
2114
+ },
2115
+ {
2116
+ "epoch": 16.0,
2117
+ "grad_norm": 0.09692618250846863,
2118
+ "learning_rate": 1.1767986192049986e-06,
2119
+ "loss": 0.0003,
2120
+ "num_input_tokens_seen": 401184,
2121
+ "step": 1280
2122
+ },
2123
+ {
2124
+ "epoch": 16.0,
2125
+ "eval_loss": 0.9965259432792664,
2126
+ "eval_runtime": 0.9197,
2127
+ "eval_samples_per_second": 86.981,
2128
+ "eval_steps_per_second": 21.745,
2129
+ "num_input_tokens_seen": 401184,
2130
+ "step": 1280
2131
+ },
2132
+ {
2133
+ "epoch": 16.0625,
2134
+ "grad_norm": 0.06563537567853928,
2135
+ "learning_rate": 1.1418770830614012e-06,
2136
+ "loss": 0.0002,
2137
+ "num_input_tokens_seen": 402816,
2138
+ "step": 1285
2139
+ },
2140
+ {
2141
+ "epoch": 16.125,
2142
+ "grad_norm": 25.29011344909668,
2143
+ "learning_rate": 1.1074146249625334e-06,
2144
+ "loss": 0.0035,
2145
+ "num_input_tokens_seen": 404384,
2146
+ "step": 1290
2147
+ },
2148
+ {
2149
+ "epoch": 16.1875,
2150
+ "grad_norm": 0.15210725367069244,
2151
+ "learning_rate": 1.0734153455962765e-06,
2152
+ "loss": 0.0002,
2153
+ "num_input_tokens_seen": 405984,
2154
+ "step": 1295
2155
+ },
2156
+ {
2157
+ "epoch": 16.25,
2158
+ "grad_norm": 1.038436770439148,
2159
+ "learning_rate": 1.0398832905368693e-06,
2160
+ "loss": 0.0003,
2161
+ "num_input_tokens_seen": 407456,
2162
+ "step": 1300
2163
+ },
2164
+ {
2165
+ "epoch": 16.3125,
2166
+ "grad_norm": 0.04672938957810402,
2167
+ "learning_rate": 1.006822449763537e-06,
2168
+ "loss": 0.0002,
2169
+ "num_input_tokens_seen": 409056,
2170
+ "step": 1305
2171
+ },
2172
+ {
2173
+ "epoch": 16.375,
2174
+ "grad_norm": 0.10642636567354202,
2175
+ "learning_rate": 9.742367571857092e-07,
2176
+ "loss": 0.0002,
2177
+ "num_input_tokens_seen": 410624,
2178
+ "step": 1310
2179
+ },
2180
+ {
2181
+ "epoch": 16.4375,
2182
+ "grad_norm": 0.012557949870824814,
2183
+ "learning_rate": 9.421300901749386e-07,
2184
+ "loss": 0.0001,
2185
+ "num_input_tokens_seen": 412192,
2186
+ "step": 1315
2187
+ },
2188
+ {
2189
+ "epoch": 16.5,
2190
+ "grad_norm": 0.02506287954747677,
2191
+ "learning_rate": 9.105062691035233e-07,
2192
+ "loss": 0.0002,
2193
+ "num_input_tokens_seen": 413728,
2194
+ "step": 1320
2195
+ },
2196
+ {
2197
+ "epoch": 16.5625,
2198
+ "grad_norm": 0.19741126894950867,
2199
+ "learning_rate": 8.793690568899216e-07,
2200
+ "loss": 0.0002,
2201
+ "num_input_tokens_seen": 415264,
2202
+ "step": 1325
2203
+ },
2204
+ {
2205
+ "epoch": 16.625,
2206
+ "grad_norm": 0.3180708587169647,
2207
+ "learning_rate": 8.487221585510075e-07,
2208
+ "loss": 0.0002,
2209
+ "num_input_tokens_seen": 416864,
2210
+ "step": 1330
2211
+ },
2212
+ {
2213
+ "epoch": 16.6875,
2214
+ "grad_norm": 0.2431306093931198,
2215
+ "learning_rate": 8.185692207612023e-07,
2216
+ "loss": 0.0003,
2217
+ "num_input_tokens_seen": 418464,
2218
+ "step": 1335
2219
+ },
2220
+ {
2221
+ "epoch": 16.75,
2222
+ "grad_norm": 0.07357773184776306,
2223
+ "learning_rate": 7.88913831418568e-07,
2224
+ "loss": 0.0008,
2225
+ "num_input_tokens_seen": 420032,
2226
+ "step": 1340
2227
+ },
2228
+ {
2229
+ "epoch": 16.8125,
2230
+ "grad_norm": 0.5140334963798523,
2231
+ "learning_rate": 7.597595192178702e-07,
2232
+ "loss": 0.0002,
2233
+ "num_input_tokens_seen": 421568,
2234
+ "step": 1345
2235
+ },
2236
+ {
2237
+ "epoch": 16.875,
2238
+ "grad_norm": 0.10227775573730469,
2239
+ "learning_rate": 7.311097532307121e-07,
2240
+ "loss": 0.0003,
2241
+ "num_input_tokens_seen": 423136,
2242
+ "step": 1350
2243
+ },
2244
+ {
2245
+ "epoch": 16.9375,
2246
+ "grad_norm": 0.04345543682575226,
2247
+ "learning_rate": 7.029679424927366e-07,
2248
+ "loss": 0.0009,
2249
+ "num_input_tokens_seen": 424640,
2250
+ "step": 1355
2251
+ },
2252
+ {
2253
+ "epoch": 17.0,
2254
+ "grad_norm": 0.2489359974861145,
2255
+ "learning_rate": 6.753374355979975e-07,
2256
+ "loss": 0.0002,
2257
+ "num_input_tokens_seen": 426208,
2258
+ "step": 1360
2259
+ },
2260
+ {
2261
+ "epoch": 17.0625,
2262
+ "grad_norm": 0.05077870190143585,
2263
+ "learning_rate": 6.482215203005016e-07,
2264
+ "loss": 0.0002,
2265
+ "num_input_tokens_seen": 427776,
2266
+ "step": 1365
2267
+ },
2268
+ {
2269
+ "epoch": 17.125,
2270
+ "grad_norm": 0.051710885018110275,
2271
+ "learning_rate": 6.216234231230012e-07,
2272
+ "loss": 0.0002,
2273
+ "num_input_tokens_seen": 429312,
2274
+ "step": 1370
2275
+ },
2276
+ {
2277
+ "epoch": 17.1875,
2278
+ "grad_norm": 0.08844368159770966,
2279
+ "learning_rate": 5.955463089730723e-07,
2280
+ "loss": 0.0001,
2281
+ "num_input_tokens_seen": 430880,
2282
+ "step": 1375
2283
+ },
2284
+ {
2285
+ "epoch": 17.25,
2286
+ "grad_norm": 0.06657011061906815,
2287
+ "learning_rate": 5.699932807665198e-07,
2288
+ "loss": 0.0002,
2289
+ "num_input_tokens_seen": 432448,
2290
+ "step": 1380
2291
+ },
2292
+ {
2293
+ "epoch": 17.3125,
2294
+ "grad_norm": 0.018467910587787628,
2295
+ "learning_rate": 5.449673790581611e-07,
2296
+ "loss": 0.0002,
2297
+ "num_input_tokens_seen": 434016,
2298
+ "step": 1385
2299
+ },
2300
+ {
2301
+ "epoch": 17.375,
2302
+ "grad_norm": 0.0749938040971756,
2303
+ "learning_rate": 5.204715816800343e-07,
2304
+ "loss": 0.0017,
2305
+ "num_input_tokens_seen": 435520,
2306
+ "step": 1390
2307
+ },
2308
+ {
2309
+ "epoch": 17.4375,
2310
+ "grad_norm": 0.09172087907791138,
2311
+ "learning_rate": 4.965088033870608e-07,
2312
+ "loss": 0.0002,
2313
+ "num_input_tokens_seen": 437056,
2314
+ "step": 1395
2315
+ },
2316
+ {
2317
+ "epoch": 17.5,
2318
+ "grad_norm": 0.05389586091041565,
2319
+ "learning_rate": 4.730818955102234e-07,
2320
+ "loss": 0.0002,
2321
+ "num_input_tokens_seen": 438624,
2322
+ "step": 1400
2323
+ },
2324
+ {
2325
+ "epoch": 17.5625,
2326
+ "grad_norm": 0.03451506793498993,
2327
+ "learning_rate": 4.501936456172845e-07,
2328
+ "loss": 0.0002,
2329
+ "num_input_tokens_seen": 440192,
2330
+ "step": 1405
2331
+ },
2332
+ {
2333
+ "epoch": 17.625,
2334
+ "grad_norm": 0.05420379713177681,
2335
+ "learning_rate": 4.278467771810896e-07,
2336
+ "loss": 0.0002,
2337
+ "num_input_tokens_seen": 441760,
2338
+ "step": 1410
2339
+ },
2340
+ {
2341
+ "epoch": 17.6875,
2342
+ "grad_norm": 0.013087287545204163,
2343
+ "learning_rate": 4.0604394925550906e-07,
2344
+ "loss": 0.0001,
2345
+ "num_input_tokens_seen": 443424,
2346
+ "step": 1415
2347
+ },
2348
+ {
2349
+ "epoch": 17.75,
2350
+ "grad_norm": 0.0480063296854496,
2351
+ "learning_rate": 3.8478775615902965e-07,
2352
+ "loss": 0.0007,
2353
+ "num_input_tokens_seen": 444992,
2354
+ "step": 1420
2355
+ },
2356
+ {
2357
+ "epoch": 17.8125,
2358
+ "grad_norm": 0.038323502987623215,
2359
+ "learning_rate": 3.6408072716606346e-07,
2360
+ "loss": 0.0002,
2361
+ "num_input_tokens_seen": 446560,
2362
+ "step": 1425
2363
+ },
2364
+ {
2365
+ "epoch": 17.875,
2366
+ "grad_norm": 0.22902554273605347,
2367
+ "learning_rate": 3.439253262059822e-07,
2368
+ "loss": 0.0002,
2369
+ "num_input_tokens_seen": 448192,
2370
+ "step": 1430
2371
+ },
2372
+ {
2373
+ "epoch": 17.9375,
2374
+ "grad_norm": 0.026656942442059517,
2375
+ "learning_rate": 3.24323951569942e-07,
2376
+ "loss": 0.0002,
2377
+ "num_input_tokens_seen": 449792,
2378
+ "step": 1435
2379
+ },
2380
+ {
2381
+ "epoch": 18.0,
2382
+ "grad_norm": 0.025560805574059486,
2383
+ "learning_rate": 3.052789356255037e-07,
2384
+ "loss": 0.0002,
2385
+ "num_input_tokens_seen": 451328,
2386
+ "step": 1440
2387
+ },
2388
+ {
2389
+ "epoch": 18.0,
2390
+ "eval_loss": 1.080618143081665,
2391
+ "eval_runtime": 0.9186,
2392
+ "eval_samples_per_second": 87.091,
2393
+ "eval_steps_per_second": 21.773,
2394
+ "num_input_tokens_seen": 451328,
2395
+ "step": 1440
2396
+ },
2397
+ {
2398
+ "epoch": 18.0625,
2399
+ "grad_norm": 0.03586863726377487,
2400
+ "learning_rate": 2.867925445391079e-07,
2401
+ "loss": 0.0004,
2402
+ "num_input_tokens_seen": 452896,
2403
+ "step": 1445
2404
+ },
2405
+ {
2406
+ "epoch": 18.125,
2407
+ "grad_norm": 0.012726732529699802,
2408
+ "learning_rate": 2.688669780064268e-07,
2409
+ "loss": 0.0002,
2410
+ "num_input_tokens_seen": 454368,
2411
+ "step": 1450
2412
+ },
2413
+ {
2414
+ "epoch": 18.1875,
2415
+ "grad_norm": 0.16940614581108093,
2416
+ "learning_rate": 2.5150436899061494e-07,
2417
+ "loss": 0.0002,
2418
+ "num_input_tokens_seen": 455936,
2419
+ "step": 1455
2420
+ },
2421
+ {
2422
+ "epoch": 18.25,
2423
+ "grad_norm": 0.10751322656869888,
2424
+ "learning_rate": 2.3470678346851517e-07,
2425
+ "loss": 0.0002,
2426
+ "num_input_tokens_seen": 457536,
2427
+ "step": 1460
2428
+ },
2429
+ {
2430
+ "epoch": 18.3125,
2431
+ "grad_norm": 0.01832233928143978,
2432
+ "learning_rate": 2.1847622018482283e-07,
2433
+ "loss": 0.0001,
2434
+ "num_input_tokens_seen": 459072,
2435
+ "step": 1465
2436
+ },
2437
+ {
2438
+ "epoch": 18.375,
2439
+ "grad_norm": 0.1913994550704956,
2440
+ "learning_rate": 2.028146104142581e-07,
2441
+ "loss": 0.0001,
2442
+ "num_input_tokens_seen": 460640,
2443
+ "step": 1470
2444
+ },
2445
+ {
2446
+ "epoch": 18.4375,
2447
+ "grad_norm": 0.17769768834114075,
2448
+ "learning_rate": 1.8772381773176417e-07,
2449
+ "loss": 0.0002,
2450
+ "num_input_tokens_seen": 462208,
2451
+ "step": 1475
2452
+ },
2453
+ {
2454
+ "epoch": 18.5,
2455
+ "grad_norm": 0.1154564693570137,
2456
+ "learning_rate": 1.7320563779075595e-07,
2457
+ "loss": 0.0001,
2458
+ "num_input_tokens_seen": 463744,
2459
+ "step": 1480
2460
+ },
2461
+ {
2462
+ "epoch": 18.5625,
2463
+ "grad_norm": 0.03802482411265373,
2464
+ "learning_rate": 1.5926179810946185e-07,
2465
+ "loss": 0.0002,
2466
+ "num_input_tokens_seen": 465344,
2467
+ "step": 1485
2468
+ },
2469
+ {
2470
+ "epoch": 18.625,
2471
+ "grad_norm": 0.16934075951576233,
2472
+ "learning_rate": 1.4589395786535954e-07,
2473
+ "loss": 0.0001,
2474
+ "num_input_tokens_seen": 466880,
2475
+ "step": 1490
2476
+ },
2477
+ {
2478
+ "epoch": 18.6875,
2479
+ "grad_norm": 0.045956190675497055,
2480
+ "learning_rate": 1.331037076977576e-07,
2481
+ "loss": 0.0002,
2482
+ "num_input_tokens_seen": 468448,
2483
+ "step": 1495
2484
+ },
2485
+ {
2486
+ "epoch": 18.75,
2487
+ "grad_norm": 0.04867981746792793,
2488
+ "learning_rate": 1.2089256951851923e-07,
2489
+ "loss": 0.0002,
2490
+ "num_input_tokens_seen": 470048,
2491
+ "step": 1500
2492
+ },
2493
+ {
2494
+ "epoch": 18.8125,
2495
+ "grad_norm": 0.04806559160351753,
2496
+ "learning_rate": 1.0926199633097156e-07,
2497
+ "loss": 0.0001,
2498
+ "num_input_tokens_seen": 471584,
2499
+ "step": 1505
2500
+ },
2501
+ {
2502
+ "epoch": 18.875,
2503
+ "grad_norm": 0.04029763862490654,
2504
+ "learning_rate": 9.821337205701664e-08,
2505
+ "loss": 0.0002,
2506
+ "num_input_tokens_seen": 473152,
2507
+ "step": 1510
2508
+ },
2509
+ {
2510
+ "epoch": 18.9375,
2511
+ "grad_norm": 0.07741861045360565,
2512
+ "learning_rate": 8.77480113724516e-08,
2513
+ "loss": 0.0004,
2514
+ "num_input_tokens_seen": 474720,
2515
+ "step": 1515
2516
+ },
2517
+ {
2518
+ "epoch": 19.0,
2519
+ "grad_norm": 0.011278778314590454,
2520
+ "learning_rate": 7.786715955054202e-08,
2521
+ "loss": 0.0001,
2522
+ "num_input_tokens_seen": 476320,
2523
+ "step": 1520
2524
+ },
2525
+ {
2526
+ "epoch": 19.0625,
2527
+ "grad_norm": 0.34853649139404297,
2528
+ "learning_rate": 6.857199231384282e-08,
2529
+ "loss": 0.0002,
2530
+ "num_input_tokens_seen": 477824,
2531
+ "step": 1525
2532
+ },
2533
+ {
2534
+ "epoch": 19.125,
2535
+ "grad_norm": 0.017191864550113678,
2536
+ "learning_rate": 5.986361569430166e-08,
2537
+ "loss": 0.0002,
2538
+ "num_input_tokens_seen": 479360,
2539
+ "step": 1530
2540
+ },
2541
+ {
2542
+ "epoch": 19.1875,
2543
+ "grad_norm": 0.08832144737243652,
2544
+ "learning_rate": 5.174306590164879e-08,
2545
+ "loss": 0.0003,
2546
+ "num_input_tokens_seen": 480896,
2547
+ "step": 1535
2548
+ },
2549
+ {
2550
+ "epoch": 19.25,
2551
+ "grad_norm": 0.17200443148612976,
2552
+ "learning_rate": 4.42113092001023e-08,
2553
+ "loss": 0.0002,
2554
+ "num_input_tokens_seen": 482464,
2555
+ "step": 1540
2556
+ },
2557
+ {
2558
+ "epoch": 19.3125,
2559
+ "grad_norm": 0.1417255848646164,
2560
+ "learning_rate": 3.726924179339009e-08,
2561
+ "loss": 0.0001,
2562
+ "num_input_tokens_seen": 484032,
2563
+ "step": 1545
2564
+ },
2565
+ {
2566
+ "epoch": 19.375,
2567
+ "grad_norm": 0.040031641721725464,
2568
+ "learning_rate": 3.09176897181096e-08,
2569
+ "loss": 0.0001,
2570
+ "num_input_tokens_seen": 485632,
2571
+ "step": 1550
2572
+ },
2573
+ {
2574
+ "epoch": 19.4375,
2575
+ "grad_norm": 0.024730442091822624,
2576
+ "learning_rate": 2.515740874544148e-08,
2577
+ "loss": 0.0002,
2578
+ "num_input_tokens_seen": 487232,
2579
+ "step": 1555
2580
+ },
2581
+ {
2582
+ "epoch": 19.5,
2583
+ "grad_norm": 0.14058633148670197,
2584
+ "learning_rate": 1.9989084291216487e-08,
2585
+ "loss": 0.0002,
2586
+ "num_input_tokens_seen": 488832,
2587
+ "step": 1560
2588
+ },
2589
+ {
2590
+ "epoch": 19.5625,
2591
+ "grad_norm": 0.15317003428936005,
2592
+ "learning_rate": 1.541333133436018e-08,
2593
+ "loss": 0.0001,
2594
+ "num_input_tokens_seen": 490400,
2595
+ "step": 1565
2596
+ },
2597
+ {
2598
+ "epoch": 19.625,
2599
+ "grad_norm": 0.05648527294397354,
2600
+ "learning_rate": 1.1430694343715354e-08,
2601
+ "loss": 0.0001,
2602
+ "num_input_tokens_seen": 491936,
2603
+ "step": 1570
2604
+ },
2605
+ {
2606
+ "epoch": 19.6875,
2607
+ "grad_norm": 0.10134085267782211,
2608
+ "learning_rate": 8.041647213256066e-09,
2609
+ "loss": 0.0002,
2610
+ "num_input_tokens_seen": 493504,
2611
+ "step": 1575
2612
+ },
2613
+ {
2614
+ "epoch": 19.75,
2615
+ "grad_norm": 0.022655094042420387,
2616
+ "learning_rate": 5.246593205699424e-09,
2617
+ "loss": 0.0001,
2618
+ "num_input_tokens_seen": 495104,
2619
+ "step": 1580
2620
+ },
2621
+ {
2622
+ "epoch": 19.8125,
2623
+ "grad_norm": 0.009513732977211475,
2624
+ "learning_rate": 3.0458649045211897e-09,
2625
+ "loss": 0.0002,
2626
+ "num_input_tokens_seen": 496672,
2627
+ "step": 1585
2628
+ },
2629
+ {
2630
+ "epoch": 19.875,
2631
+ "grad_norm": 0.013120757415890694,
2632
+ "learning_rate": 1.4397241743813185e-09,
2633
+ "loss": 0.0001,
2634
+ "num_input_tokens_seen": 498240,
2635
+ "step": 1590
2636
+ },
2637
+ {
2638
+ "epoch": 19.9375,
2639
+ "grad_norm": 0.06909758597612381,
2640
+ "learning_rate": 4.283621299649987e-10,
2641
+ "loss": 0.0002,
2642
+ "num_input_tokens_seen": 499808,
2643
+ "step": 1595
2644
+ },
2645
+ {
2646
+ "epoch": 20.0,
2647
+ "grad_norm": 0.05095551162958145,
2648
+ "learning_rate": 1.189911324084303e-11,
2649
+ "loss": 0.0003,
2650
+ "num_input_tokens_seen": 501440,
2651
+ "step": 1600
2652
+ },
2653
+ {
2654
+ "epoch": 20.0,
2655
+ "eval_loss": 1.093362808227539,
2656
+ "eval_runtime": 0.919,
2657
+ "eval_samples_per_second": 87.054,
2658
+ "eval_steps_per_second": 21.763,
2659
+ "num_input_tokens_seen": 501440,
2660
+ "step": 1600
2661
+ },
2662
+ {
2663
+ "epoch": 20.0,
2664
+ "num_input_tokens_seen": 501440,
2665
+ "step": 1600,
2666
+ "total_flos": 2.257961656516608e+16,
2667
+ "train_loss": 0.2722132059369324,
2668
+ "train_runtime": 168.4339,
2669
+ "train_samples_per_second": 37.997,
2670
+ "train_steps_per_second": 9.499
2671
+ }
2672
+ ],
2673
+ "logging_steps": 5,
2674
+ "max_steps": 1600,
2675
+ "num_input_tokens_seen": 501440,
2676
+ "num_train_epochs": 20,
2677
+ "save_steps": 160,
2678
+ "stateful_callbacks": {
2679
+ "TrainerControl": {
2680
+ "args": {
2681
+ "should_epoch_stop": false,
2682
+ "should_evaluate": false,
2683
+ "should_log": false,
2684
+ "should_save": true,
2685
+ "should_training_stop": true
2686
+ },
2687
+ "attributes": {}
2688
+ }
2689
+ },
2690
+ "total_flos": 2.257961656516608e+16,
2691
+ "train_batch_size": 4,
2692
+ "trial_name": null,
2693
+ "trial_params": null
2694
+ }
training_eval_loss.png ADDED
training_loss.png ADDED