hellosindh commited on
Commit
5308429
·
verified ·
1 Parent(s): a8234de

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -11,7 +11,14 @@ tags:
11
 
12
  # Sindhi-BERT-base
13
 
14
- The first BERT-style language model trained from scratch on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
 
 
 
 
 
 
 
15
 
16
  ## Model Details
17
 
@@ -19,62 +26,14 @@ The first BERT-style language model trained from scratch on Sindhi text, using a
19
  |---|---|
20
  | Architecture | RoBERTa-base |
21
  | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
22
- | Hidden size | 768 |
23
- | Layers | 12 |
24
- | Attention heads | 12 |
25
- | Max length | 512 tokens |
26
  | Parameters | ~125M |
27
  | Language | Sindhi (sd) |
28
  | License | MIT |
29
 
30
- ## Training Details
31
-
32
- | Detail | Value |
33
- |---|---|
34
- | Training data | 500K Sindhi sentences |
35
- | Full corpus size | 447 MB clean Sindhi text |
36
- | Epochs | 5 |
37
- | Batch size | 256 (effective) |
38
- | Learning rate | 1e-4 |
39
- | Hardware | A100 GPU |
40
- | Training time | 301.7 minutes |
41
- | Final eval loss | 4.358 |
42
- | Final perplexity | 78.10 |
43
-
44
- ## Tokenizer
45
-
46
- Trained a custom Sindhi BPE tokenizer with 32,000 vocabulary size built specifically for Sindhi script. Every token is a real Sindhi word or subword — unlike multilingual models like mBERT or XLM-R which give Sindhi very limited vocabulary coverage.
47
-
48
- Each Sindhi word stays as ONE whole token:
49
-
50
- Input : سنڌي ٻولي دنيا جي قديم ٻولين مان هڪ آهي
51
-
52
- Tokens : ['سنڌي', 'ٻولي', 'دنيا', 'جي', 'قديم', 'ٻولين', 'مان', 'هڪ', 'آهي']
53
-
54
- Count : 9 words = 9 tokens
55
-
56
- ## Fill-Mask Results
57
-
58
- Tested on 10 Sindhi sentences after 5 epochs of training:
59
-
60
- | Input | Top Prediction | Score | Quality |
61
- |---|---|---|---|
62
- | سنڌي ___ دنيا جي قديم ٻولين | ٻولي (language) | 15.47% | Perfect |
63
- | شاهه لطيف سنڌي ___ جو وڏو شاعر | ادب (literature) | 16.48% | Perfect |
64
- | استاد شاگردن کي ___ سيکاري | تعليم (education) | 5.58% | Perfect |
65
- | دنيا ___ گھڻي مصروف آھي | ۾ (in) | 27.98% | Correct |
66
- | سنڌ جي ___ ڏاڍي پراڻي آهي | تاريخ (history) | Top 2 | Good |
67
- | ڪراچي سنڌ جو سڀ کان وڏو ___ آهي | شهر (city) | Top 3 | Close |
68
- | ٻار ___ ۾ پڙهن ٿا | گهر (home) | Top 2 | Close |
69
-
70
- Overall: 50% top-1 accuracy after 5 epochs on 500K sentences.
71
- Results improve significantly with more training.
72
-
73
- ## Citation
74
-
75
- If you use this model please cite:
76
 
77
- sindhibert2026,
78
- title = Sindhi-BERT: A Sindhi Language Model Trained From Scratch,
79
- year = 2026,
80
- url = https://huggingface.co/hellosindh/sindhi-bert-base
 
 
11
 
12
  # Sindhi-BERT-base
13
 
14
+ BERT-style model trained from scratch on Sindhi text.
15
+
16
+ ## Training History
17
+
18
+ | Session | Data | Epochs | Perplexity |
19
+ |---|---|---|---|
20
+ | Session 1 | 500K lines | 5 | 78.10 |
21
+ | Session 2 | 2.1M lines | 3 | 41.62 |
22
 
23
  ## Model Details
24
 
 
26
  |---|---|
27
  | Architecture | RoBERTa-base |
28
  | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
 
 
 
 
29
  | Parameters | ~125M |
30
  | Language | Sindhi (sd) |
31
  | License | MIT |
32
 
33
+ ## Roadmap
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
+ - [x] Session 1 — 500K lines, 5 epochs
36
+ - [x] Session 2 2.1M lines, 3 more epochs
37
+ - [ ] Session 3 — more epochs
38
+ - [ ] Spell checker fine-tuning
39
+ - [ ] Next word prediction
checkpoint-13992/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 1,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 514,
19
+ "model_type": "roberta",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "tie_word_embeddings": true,
24
+ "transformers_version": "5.0.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": false,
27
+ "vocab_size": 32001
28
+ }
checkpoint-13992/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f30d3263e4f5945a574c395e67860f77b1c2ad46656375ee05131edc7f9af6e
3
+ size 442633860
checkpoint-13992/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:317967ab747cc0af5476a61cd478cf7c0a8ebae90d31f4849479a9fb3cf1b04d
3
+ size 885391563
checkpoint-13992/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d211a47b4b3b335c468d44ed76f120d5089780b4eb88aa346a57b160dd0dda15
3
+ size 14645
checkpoint-13992/scaler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ecd99389d0b0fa98b257c145b8eae39d68db39de8045d8bf547a8b89832a7a5
3
+ size 1383
checkpoint-13992/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47503d0942a02a8bf8bef3e54b3e92aca8d6588333c6748da7af2330423c43ae
3
+ size 1465
checkpoint-13992/trainer_state.json ADDED
@@ -0,0 +1,1095 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 13992,
3
+ "best_metric": 3.7217817306518555,
4
+ "best_model_checkpoint": "sindhibert_session2/checkpoint-13992",
5
+ "epoch": 2.748397868735728,
6
+ "eval_steps": 1272,
7
+ "global_step": 13992,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.019642988680727773,
14
+ "grad_norm": 16.122173309326172,
15
+ "learning_rate": 9.900000000000002e-06,
16
+ "loss": 36.5051904296875,
17
+ "step": 100
18
+ },
19
+ {
20
+ "epoch": 0.039285977361455546,
21
+ "grad_norm": 18.11626434326172,
22
+ "learning_rate": 1.9900000000000003e-05,
23
+ "loss": 36.368505859375,
24
+ "step": 200
25
+ },
26
+ {
27
+ "epoch": 0.05892896604218332,
28
+ "grad_norm": 18.927509307861328,
29
+ "learning_rate": 2.9900000000000002e-05,
30
+ "loss": 36.45832275390625,
31
+ "step": 300
32
+ },
33
+ {
34
+ "epoch": 0.07857195472291109,
35
+ "grad_norm": 18.480167388916016,
36
+ "learning_rate": 3.99e-05,
37
+ "loss": 36.31460693359375,
38
+ "step": 400
39
+ },
40
+ {
41
+ "epoch": 0.09821494340363886,
42
+ "grad_norm": 19.010652542114258,
43
+ "learning_rate": 4.99e-05,
44
+ "loss": 36.24136962890625,
45
+ "step": 500
46
+ },
47
+ {
48
+ "epoch": 0.11785793208436664,
49
+ "grad_norm": 19.11362075805664,
50
+ "learning_rate": 4.966492926284438e-05,
51
+ "loss": 36.54402587890625,
52
+ "step": 600
53
+ },
54
+ {
55
+ "epoch": 0.1375009207650944,
56
+ "grad_norm": 19.41926383972168,
57
+ "learning_rate": 4.93264739727882e-05,
58
+ "loss": 36.28423095703125,
59
+ "step": 700
60
+ },
61
+ {
62
+ "epoch": 0.15714390944582218,
63
+ "grad_norm": 20.20358657836914,
64
+ "learning_rate": 4.898801868273201e-05,
65
+ "loss": 36.2180615234375,
66
+ "step": 800
67
+ },
68
+ {
69
+ "epoch": 0.17678689812654996,
70
+ "grad_norm": 18.598957061767578,
71
+ "learning_rate": 4.864956339267583e-05,
72
+ "loss": 36.1883984375,
73
+ "step": 900
74
+ },
75
+ {
76
+ "epoch": 0.19642988680727771,
77
+ "grad_norm": 18.320087432861328,
78
+ "learning_rate": 4.831110810261965e-05,
79
+ "loss": 36.0933837890625,
80
+ "step": 1000
81
+ },
82
+ {
83
+ "epoch": 0.2160728754880055,
84
+ "grad_norm": 21.353364944458008,
85
+ "learning_rate": 4.797265281256346e-05,
86
+ "loss": 35.95175048828125,
87
+ "step": 1100
88
+ },
89
+ {
90
+ "epoch": 0.23571586416873327,
91
+ "grad_norm": 19.709365844726562,
92
+ "learning_rate": 4.763419752250728e-05,
93
+ "loss": 35.99169921875,
94
+ "step": 1200
95
+ },
96
+ {
97
+ "epoch": 0.24985881601885726,
98
+ "eval_loss": 4.386976718902588,
99
+ "eval_runtime": 41.9212,
100
+ "eval_samples_per_second": 477.086,
101
+ "eval_steps_per_second": 14.909,
102
+ "step": 1272
103
+ },
104
+ {
105
+ "epoch": 0.255358852849461,
106
+ "grad_norm": 17.78667449951172,
107
+ "learning_rate": 4.729574223245109e-05,
108
+ "loss": 35.86061279296875,
109
+ "step": 1300
110
+ },
111
+ {
112
+ "epoch": 0.2750018415301888,
113
+ "grad_norm": 21.036474227905273,
114
+ "learning_rate": 4.695728694239491e-05,
115
+ "loss": 35.78255859375,
116
+ "step": 1400
117
+ },
118
+ {
119
+ "epoch": 0.2946448302109166,
120
+ "grad_norm": 19.002052307128906,
121
+ "learning_rate": 4.661883165233873e-05,
122
+ "loss": 35.5993408203125,
123
+ "step": 1500
124
+ },
125
+ {
126
+ "epoch": 0.31428781889164437,
127
+ "grad_norm": 19.828800201416016,
128
+ "learning_rate": 4.6280376362282543e-05,
129
+ "loss": 35.70918701171875,
130
+ "step": 1600
131
+ },
132
+ {
133
+ "epoch": 0.33393080757237215,
134
+ "grad_norm": 17.12042236328125,
135
+ "learning_rate": 4.594192107222636e-05,
136
+ "loss": 35.58583251953125,
137
+ "step": 1700
138
+ },
139
+ {
140
+ "epoch": 0.3535737962530999,
141
+ "grad_norm": 19.869712829589844,
142
+ "learning_rate": 4.560346578217018e-05,
143
+ "loss": 35.3600537109375,
144
+ "step": 1800
145
+ },
146
+ {
147
+ "epoch": 0.3732167849338277,
148
+ "grad_norm": 18.914470672607422,
149
+ "learning_rate": 4.5265010492113994e-05,
150
+ "loss": 35.34132080078125,
151
+ "step": 1900
152
+ },
153
+ {
154
+ "epoch": 0.39285977361455543,
155
+ "grad_norm": 18.888940811157227,
156
+ "learning_rate": 4.4926555202057814e-05,
157
+ "loss": 35.33590576171875,
158
+ "step": 2000
159
+ },
160
+ {
161
+ "epoch": 0.4125027622952832,
162
+ "grad_norm": 20.27227783203125,
163
+ "learning_rate": 4.4588099912001626e-05,
164
+ "loss": 35.22805908203125,
165
+ "step": 2100
166
+ },
167
+ {
168
+ "epoch": 0.432145750976011,
169
+ "grad_norm": 15.550501823425293,
170
+ "learning_rate": 4.424964462194544e-05,
171
+ "loss": 35.256953125,
172
+ "step": 2200
173
+ },
174
+ {
175
+ "epoch": 0.45178873965673877,
176
+ "grad_norm": 17.671451568603516,
177
+ "learning_rate": 4.391118933188926e-05,
178
+ "loss": 35.056826171875,
179
+ "step": 2300
180
+ },
181
+ {
182
+ "epoch": 0.47143172833746655,
183
+ "grad_norm": 18.74838638305664,
184
+ "learning_rate": 4.357273404183308e-05,
185
+ "loss": 34.9271826171875,
186
+ "step": 2400
187
+ },
188
+ {
189
+ "epoch": 0.49107471701819433,
190
+ "grad_norm": 20.912931442260742,
191
+ "learning_rate": 4.323427875177689e-05,
192
+ "loss": 34.793916015625,
193
+ "step": 2500
194
+ },
195
+ {
196
+ "epoch": 0.4997176320377145,
197
+ "eval_loss": 4.270185947418213,
198
+ "eval_runtime": 41.8695,
199
+ "eval_samples_per_second": 477.675,
200
+ "eval_steps_per_second": 14.927,
201
+ "step": 2544
202
+ },
203
+ {
204
+ "epoch": 0.510717705698922,
205
+ "grad_norm": 18.11321258544922,
206
+ "learning_rate": 4.289582346172071e-05,
207
+ "loss": 34.97314453125,
208
+ "step": 2600
209
+ },
210
+ {
211
+ "epoch": 0.5303606943796498,
212
+ "grad_norm": 19.38896942138672,
213
+ "learning_rate": 4.255736817166453e-05,
214
+ "loss": 34.873974609375,
215
+ "step": 2700
216
+ },
217
+ {
218
+ "epoch": 0.5500036830603776,
219
+ "grad_norm": 19.479278564453125,
220
+ "learning_rate": 4.221891288160834e-05,
221
+ "loss": 34.713251953125,
222
+ "step": 2800
223
+ },
224
+ {
225
+ "epoch": 0.5696466717411054,
226
+ "grad_norm": 19.849210739135742,
227
+ "learning_rate": 4.188045759155216e-05,
228
+ "loss": 34.66454833984375,
229
+ "step": 2900
230
+ },
231
+ {
232
+ "epoch": 0.5892896604218332,
233
+ "grad_norm": 19.59630012512207,
234
+ "learning_rate": 4.154200230149597e-05,
235
+ "loss": 34.546259765625,
236
+ "step": 3000
237
+ },
238
+ {
239
+ "epoch": 0.608932649102561,
240
+ "grad_norm": 18.32138442993164,
241
+ "learning_rate": 4.120354701143979e-05,
242
+ "loss": 34.460576171875,
243
+ "step": 3100
244
+ },
245
+ {
246
+ "epoch": 0.6285756377832887,
247
+ "grad_norm": 17.825464248657227,
248
+ "learning_rate": 4.086509172138361e-05,
249
+ "loss": 34.3186962890625,
250
+ "step": 3200
251
+ },
252
+ {
253
+ "epoch": 0.6482186264640165,
254
+ "grad_norm": 19.180105209350586,
255
+ "learning_rate": 4.052663643132742e-05,
256
+ "loss": 34.40132080078125,
257
+ "step": 3300
258
+ },
259
+ {
260
+ "epoch": 0.6678616151447443,
261
+ "grad_norm": 18.498641967773438,
262
+ "learning_rate": 4.018818114127124e-05,
263
+ "loss": 34.2248828125,
264
+ "step": 3400
265
+ },
266
+ {
267
+ "epoch": 0.6875046038254721,
268
+ "grad_norm": 19.08323097229004,
269
+ "learning_rate": 3.984972585121506e-05,
270
+ "loss": 33.97822265625,
271
+ "step": 3500
272
+ },
273
+ {
274
+ "epoch": 0.7071475925061999,
275
+ "grad_norm": 20.13410758972168,
276
+ "learning_rate": 3.9511270561158874e-05,
277
+ "loss": 34.1056494140625,
278
+ "step": 3600
279
+ },
280
+ {
281
+ "epoch": 0.7267905811869276,
282
+ "grad_norm": 18.82459259033203,
283
+ "learning_rate": 3.917281527110269e-05,
284
+ "loss": 34.2245947265625,
285
+ "step": 3700
286
+ },
287
+ {
288
+ "epoch": 0.7464335698676554,
289
+ "grad_norm": 17.352100372314453,
290
+ "learning_rate": 3.8834359981046505e-05,
291
+ "loss": 34.04242919921875,
292
+ "step": 3800
293
+ },
294
+ {
295
+ "epoch": 0.7495764480565718,
296
+ "eval_loss": 4.178581237792969,
297
+ "eval_runtime": 42.0642,
298
+ "eval_samples_per_second": 475.463,
299
+ "eval_steps_per_second": 14.858,
300
+ "step": 3816
301
+ },
302
+ {
303
+ "epoch": 0.7660765585483832,
304
+ "grad_norm": 19.103435516357422,
305
+ "learning_rate": 3.849590469099032e-05,
306
+ "loss": 34.01751708984375,
307
+ "step": 3900
308
+ },
309
+ {
310
+ "epoch": 0.7857195472291109,
311
+ "grad_norm": 18.045913696289062,
312
+ "learning_rate": 3.815744940093414e-05,
313
+ "loss": 33.90669921875,
314
+ "step": 4000
315
+ },
316
+ {
317
+ "epoch": 0.8053625359098386,
318
+ "grad_norm": 18.76168441772461,
319
+ "learning_rate": 3.7818994110877956e-05,
320
+ "loss": 33.872587890625,
321
+ "step": 4100
322
+ },
323
+ {
324
+ "epoch": 0.8250055245905664,
325
+ "grad_norm": 16.547574996948242,
326
+ "learning_rate": 3.748053882082177e-05,
327
+ "loss": 33.8643359375,
328
+ "step": 4200
329
+ },
330
+ {
331
+ "epoch": 0.8446485132712942,
332
+ "grad_norm": 18.636455535888672,
333
+ "learning_rate": 3.714208353076559e-05,
334
+ "loss": 33.7685595703125,
335
+ "step": 4300
336
+ },
337
+ {
338
+ "epoch": 0.864291501952022,
339
+ "grad_norm": 18.742900848388672,
340
+ "learning_rate": 3.68036282407094e-05,
341
+ "loss": 33.53510009765625,
342
+ "step": 4400
343
+ },
344
+ {
345
+ "epoch": 0.8839344906327498,
346
+ "grad_norm": 20.976703643798828,
347
+ "learning_rate": 3.646517295065322e-05,
348
+ "loss": 33.42730224609375,
349
+ "step": 4500
350
+ },
351
+ {
352
+ "epoch": 0.9035774793134775,
353
+ "grad_norm": 18.552316665649414,
354
+ "learning_rate": 3.612671766059704e-05,
355
+ "loss": 33.53127197265625,
356
+ "step": 4600
357
+ },
358
+ {
359
+ "epoch": 0.9232204679942053,
360
+ "grad_norm": 21.41478157043457,
361
+ "learning_rate": 3.578826237054085e-05,
362
+ "loss": 33.41383544921875,
363
+ "step": 4700
364
+ },
365
+ {
366
+ "epoch": 0.9428634566749331,
367
+ "grad_norm": 19.785966873168945,
368
+ "learning_rate": 3.544980708048467e-05,
369
+ "loss": 33.38103271484375,
370
+ "step": 4800
371
+ },
372
+ {
373
+ "epoch": 0.9625064453556609,
374
+ "grad_norm": 17.69455337524414,
375
+ "learning_rate": 3.511135179042849e-05,
376
+ "loss": 33.4733447265625,
377
+ "step": 4900
378
+ },
379
+ {
380
+ "epoch": 0.9821494340363887,
381
+ "grad_norm": 20.246673583984375,
382
+ "learning_rate": 3.47728965003723e-05,
383
+ "loss": 33.4089208984375,
384
+ "step": 5000
385
+ },
386
+ {
387
+ "epoch": 0.999435264075429,
388
+ "eval_loss": 4.0612263679504395,
389
+ "eval_runtime": 42.4302,
390
+ "eval_samples_per_second": 471.362,
391
+ "eval_steps_per_second": 14.73,
392
+ "step": 5088
393
+ },
394
+ {
395
+ "epoch": 1.0017678689812655,
396
+ "grad_norm": 19.691932678222656,
397
+ "learning_rate": 3.443444121031612e-05,
398
+ "loss": 33.1838037109375,
399
+ "step": 5100
400
+ },
401
+ {
402
+ "epoch": 1.0214108576619934,
403
+ "grad_norm": 18.13388442993164,
404
+ "learning_rate": 3.409598592025994e-05,
405
+ "loss": 33.21130859375,
406
+ "step": 5200
407
+ },
408
+ {
409
+ "epoch": 1.041053846342721,
410
+ "grad_norm": 18.41756248474121,
411
+ "learning_rate": 3.375753063020375e-05,
412
+ "loss": 33.07806396484375,
413
+ "step": 5300
414
+ },
415
+ {
416
+ "epoch": 1.060696835023449,
417
+ "grad_norm": 18.85491943359375,
418
+ "learning_rate": 3.3419075340147566e-05,
419
+ "loss": 32.95417724609375,
420
+ "step": 5400
421
+ },
422
+ {
423
+ "epoch": 1.0803398237041766,
424
+ "grad_norm": 20.03909683227539,
425
+ "learning_rate": 3.3080620050091385e-05,
426
+ "loss": 32.93142822265625,
427
+ "step": 5500
428
+ },
429
+ {
430
+ "epoch": 1.0999828123849045,
431
+ "grad_norm": 19.49604606628418,
432
+ "learning_rate": 3.27421647600352e-05,
433
+ "loss": 32.9296630859375,
434
+ "step": 5600
435
+ },
436
+ {
437
+ "epoch": 1.1196258010656321,
438
+ "grad_norm": 21.259004592895508,
439
+ "learning_rate": 3.2403709469979017e-05,
440
+ "loss": 32.85189453125,
441
+ "step": 5700
442
+ },
443
+ {
444
+ "epoch": 1.1392687897463598,
445
+ "grad_norm": 19.597267150878906,
446
+ "learning_rate": 3.206525417992283e-05,
447
+ "loss": 32.82923583984375,
448
+ "step": 5800
449
+ },
450
+ {
451
+ "epoch": 1.1589117784270877,
452
+ "grad_norm": 20.224699020385742,
453
+ "learning_rate": 3.172679888986665e-05,
454
+ "loss": 32.7767626953125,
455
+ "step": 5900
456
+ },
457
+ {
458
+ "epoch": 1.1785547671078154,
459
+ "grad_norm": 18.452495574951172,
460
+ "learning_rate": 3.138834359981047e-05,
461
+ "loss": 32.6592626953125,
462
+ "step": 6000
463
+ },
464
+ {
465
+ "epoch": 1.1981977557885433,
466
+ "grad_norm": 19.971717834472656,
467
+ "learning_rate": 3.104988830975428e-05,
468
+ "loss": 32.80323974609375,
469
+ "step": 6100
470
+ },
471
+ {
472
+ "epoch": 1.217840744469271,
473
+ "grad_norm": 17.584882736206055,
474
+ "learning_rate": 3.07114330196981e-05,
475
+ "loss": 32.6551416015625,
476
+ "step": 6200
477
+ },
478
+ {
479
+ "epoch": 1.2374837331499988,
480
+ "grad_norm": 19.49502944946289,
481
+ "learning_rate": 3.0372977729641915e-05,
482
+ "loss": 32.73489990234375,
483
+ "step": 6300
484
+ },
485
+ {
486
+ "epoch": 1.2492695263584355,
487
+ "eval_loss": 3.9626989364624023,
488
+ "eval_runtime": 41.8921,
489
+ "eval_samples_per_second": 477.418,
490
+ "eval_steps_per_second": 14.919,
491
+ "step": 6360
492
+ },
493
+ {
494
+ "epoch": 1.2571267218307265,
495
+ "grad_norm": 20.214130401611328,
496
+ "learning_rate": 3.0034522439585734e-05,
497
+ "loss": 32.55817138671875,
498
+ "step": 6400
499
+ },
500
+ {
501
+ "epoch": 1.2767697105114544,
502
+ "grad_norm": 20.668527603149414,
503
+ "learning_rate": 2.969606714952955e-05,
504
+ "loss": 32.53630615234375,
505
+ "step": 6500
506
+ },
507
+ {
508
+ "epoch": 1.296412699192182,
509
+ "grad_norm": 18.479408264160156,
510
+ "learning_rate": 2.9357611859473366e-05,
511
+ "loss": 32.464462890625,
512
+ "step": 6600
513
+ },
514
+ {
515
+ "epoch": 1.31605568787291,
516
+ "grad_norm": 19.027793884277344,
517
+ "learning_rate": 2.9019156569417182e-05,
518
+ "loss": 32.48625244140625,
519
+ "step": 6700
520
+ },
521
+ {
522
+ "epoch": 1.3356986765536376,
523
+ "grad_norm": 19.871105194091797,
524
+ "learning_rate": 2.8680701279361e-05,
525
+ "loss": 32.41595947265625,
526
+ "step": 6800
527
+ },
528
+ {
529
+ "epoch": 1.3553416652343655,
530
+ "grad_norm": 19.916994094848633,
531
+ "learning_rate": 2.8342245989304817e-05,
532
+ "loss": 32.11419921875,
533
+ "step": 6900
534
+ },
535
+ {
536
+ "epoch": 1.3749846539150932,
537
+ "grad_norm": 21.212909698486328,
538
+ "learning_rate": 2.8003790699248633e-05,
539
+ "loss": 32.30314453125,
540
+ "step": 7000
541
+ },
542
+ {
543
+ "epoch": 1.3946276425958208,
544
+ "grad_norm": 25.216768264770508,
545
+ "learning_rate": 2.7665335409192445e-05,
546
+ "loss": 32.19344482421875,
547
+ "step": 7100
548
+ },
549
+ {
550
+ "epoch": 1.4142706312765487,
551
+ "grad_norm": 19.619844436645508,
552
+ "learning_rate": 2.732688011913626e-05,
553
+ "loss": 32.30953125,
554
+ "step": 7200
555
+ },
556
+ {
557
+ "epoch": 1.4339136199572766,
558
+ "grad_norm": 21.061376571655273,
559
+ "learning_rate": 2.6988424829080077e-05,
560
+ "loss": 32.3416162109375,
561
+ "step": 7300
562
+ },
563
+ {
564
+ "epoch": 1.4535566086380043,
565
+ "grad_norm": 18.674562454223633,
566
+ "learning_rate": 2.6649969539023896e-05,
567
+ "loss": 32.2662744140625,
568
+ "step": 7400
569
+ },
570
+ {
571
+ "epoch": 1.473199597318732,
572
+ "grad_norm": 18.776655197143555,
573
+ "learning_rate": 2.6311514248967712e-05,
574
+ "loss": 32.07302978515625,
575
+ "step": 7500
576
+ },
577
+ {
578
+ "epoch": 1.4928425859994598,
579
+ "grad_norm": 19.0480899810791,
580
+ "learning_rate": 2.5973058958911528e-05,
581
+ "loss": 32.19434326171875,
582
+ "step": 7600
583
+ },
584
+ {
585
+ "epoch": 1.4991283423772928,
586
+ "eval_loss": 3.9200026988983154,
587
+ "eval_runtime": 42.1732,
588
+ "eval_samples_per_second": 474.235,
589
+ "eval_steps_per_second": 14.82,
590
+ "step": 7632
591
+ },
592
+ {
593
+ "epoch": 1.5124855746801877,
594
+ "grad_norm": 18.192241668701172,
595
+ "learning_rate": 2.5634603668855344e-05,
596
+ "loss": 32.0382470703125,
597
+ "step": 7700
598
+ },
599
+ {
600
+ "epoch": 1.5321285633609154,
601
+ "grad_norm": 21.64850425720215,
602
+ "learning_rate": 2.5296148378799163e-05,
603
+ "loss": 32.0071484375,
604
+ "step": 7800
605
+ },
606
+ {
607
+ "epoch": 1.551771552041643,
608
+ "grad_norm": 21.07256507873535,
609
+ "learning_rate": 2.495769308874298e-05,
610
+ "loss": 32.02736328125,
611
+ "step": 7900
612
+ },
613
+ {
614
+ "epoch": 1.5714145407223707,
615
+ "grad_norm": 18.811485290527344,
616
+ "learning_rate": 2.4619237798686794e-05,
617
+ "loss": 32.18232666015625,
618
+ "step": 8000
619
+ },
620
+ {
621
+ "epoch": 1.5910575294030986,
622
+ "grad_norm": 20.226411819458008,
623
+ "learning_rate": 2.428078250863061e-05,
624
+ "loss": 31.81751220703125,
625
+ "step": 8100
626
+ },
627
+ {
628
+ "epoch": 1.6107005180838265,
629
+ "grad_norm": 21.44918441772461,
630
+ "learning_rate": 2.394232721857443e-05,
631
+ "loss": 31.89043212890625,
632
+ "step": 8200
633
+ },
634
+ {
635
+ "epoch": 1.6303435067645542,
636
+ "grad_norm": 19.660367965698242,
637
+ "learning_rate": 2.3603871928518245e-05,
638
+ "loss": 31.95683349609375,
639
+ "step": 8300
640
+ },
641
+ {
642
+ "epoch": 1.6499864954452819,
643
+ "grad_norm": 19.144596099853516,
644
+ "learning_rate": 2.3265416638462058e-05,
645
+ "loss": 31.867197265625,
646
+ "step": 8400
647
+ },
648
+ {
649
+ "epoch": 1.6696294841260098,
650
+ "grad_norm": 18.604026794433594,
651
+ "learning_rate": 2.2926961348405877e-05,
652
+ "loss": 31.78265380859375,
653
+ "step": 8500
654
+ },
655
+ {
656
+ "epoch": 1.6892724728067376,
657
+ "grad_norm": 19.978652954101562,
658
+ "learning_rate": 2.2588506058349693e-05,
659
+ "loss": 31.79925048828125,
660
+ "step": 8600
661
+ },
662
+ {
663
+ "epoch": 1.7089154614874653,
664
+ "grad_norm": 18.18141746520996,
665
+ "learning_rate": 2.225005076829351e-05,
666
+ "loss": 31.853974609375,
667
+ "step": 8700
668
+ },
669
+ {
670
+ "epoch": 1.728558450168193,
671
+ "grad_norm": 17.99820899963379,
672
+ "learning_rate": 2.1911595478237325e-05,
673
+ "loss": 31.75265625,
674
+ "step": 8800
675
+ },
676
+ {
677
+ "epoch": 1.7482014388489209,
678
+ "grad_norm": 20.680606842041016,
679
+ "learning_rate": 2.1573140188181144e-05,
680
+ "loss": 31.6561328125,
681
+ "step": 8900
682
+ },
683
+ {
684
+ "epoch": 1.74898715839615,
685
+ "eval_loss": 3.863434076309204,
686
+ "eval_runtime": 42.0055,
687
+ "eval_samples_per_second": 476.128,
688
+ "eval_steps_per_second": 14.879,
689
+ "step": 8904
690
+ },
691
+ {
692
+ "epoch": 1.7678444275296488,
693
+ "grad_norm": 20.50802993774414,
694
+ "learning_rate": 2.123468489812496e-05,
695
+ "loss": 31.6072265625,
696
+ "step": 9000
697
+ },
698
+ {
699
+ "epoch": 1.7874874162103764,
700
+ "grad_norm": 21.482328414916992,
701
+ "learning_rate": 2.0896229608068775e-05,
702
+ "loss": 31.7250537109375,
703
+ "step": 9100
704
+ },
705
+ {
706
+ "epoch": 1.807130404891104,
707
+ "grad_norm": 19.20509910583496,
708
+ "learning_rate": 2.055777431801259e-05,
709
+ "loss": 31.58796875,
710
+ "step": 9200
711
+ },
712
+ {
713
+ "epoch": 1.826773393571832,
714
+ "grad_norm": 21.03694725036621,
715
+ "learning_rate": 2.0219319027956407e-05,
716
+ "loss": 31.68398193359375,
717
+ "step": 9300
718
+ },
719
+ {
720
+ "epoch": 1.8464163822525599,
721
+ "grad_norm": 18.272459030151367,
722
+ "learning_rate": 1.9880863737900223e-05,
723
+ "loss": 31.53086181640625,
724
+ "step": 9400
725
+ },
726
+ {
727
+ "epoch": 1.8660593709332876,
728
+ "grad_norm": 19.046916961669922,
729
+ "learning_rate": 1.9542408447844042e-05,
730
+ "loss": 31.525322265625,
731
+ "step": 9500
732
+ },
733
+ {
734
+ "epoch": 1.8857023596140152,
735
+ "grad_norm": 21.118305206298828,
736
+ "learning_rate": 1.9203953157787858e-05,
737
+ "loss": 31.52841552734375,
738
+ "step": 9600
739
+ },
740
+ {
741
+ "epoch": 1.905345348294743,
742
+ "grad_norm": 18.861080169677734,
743
+ "learning_rate": 1.8865497867731674e-05,
744
+ "loss": 31.4529345703125,
745
+ "step": 9700
746
+ },
747
+ {
748
+ "epoch": 1.9249883369754708,
749
+ "grad_norm": 20.4729061126709,
750
+ "learning_rate": 1.852704257767549e-05,
751
+ "loss": 31.35305419921875,
752
+ "step": 9800
753
+ },
754
+ {
755
+ "epoch": 1.9446313256561987,
756
+ "grad_norm": 17.702392578125,
757
+ "learning_rate": 1.818858728761931e-05,
758
+ "loss": 31.5552734375,
759
+ "step": 9900
760
+ },
761
+ {
762
+ "epoch": 1.9642743143369263,
763
+ "grad_norm": 21.927942276000977,
764
+ "learning_rate": 1.785013199756312e-05,
765
+ "loss": 31.42951416015625,
766
+ "step": 10000
767
+ },
768
+ {
769
+ "epoch": 1.983917303017654,
770
+ "grad_norm": 19.895252227783203,
771
+ "learning_rate": 1.7511676707506937e-05,
772
+ "loss": 31.416962890625,
773
+ "step": 10100
774
+ },
775
+ {
776
+ "epoch": 1.998845974415007,
777
+ "eval_loss": 3.819389820098877,
778
+ "eval_runtime": 41.9176,
779
+ "eval_samples_per_second": 477.126,
780
+ "eval_steps_per_second": 14.91,
781
+ "step": 10176
782
+ },
783
+ {
784
+ "epoch": 2.003535737962531,
785
+ "grad_norm": 20.209577560424805,
786
+ "learning_rate": 1.7173221417450756e-05,
787
+ "loss": 31.25757568359375,
788
+ "step": 10200
789
+ },
790
+ {
791
+ "epoch": 2.0231787266432586,
792
+ "grad_norm": 19.49869155883789,
793
+ "learning_rate": 1.6834766127394572e-05,
794
+ "loss": 31.14395263671875,
795
+ "step": 10300
796
+ },
797
+ {
798
+ "epoch": 2.0428217153239867,
799
+ "grad_norm": 20.60426139831543,
800
+ "learning_rate": 1.6496310837338388e-05,
801
+ "loss": 31.38017333984375,
802
+ "step": 10400
803
+ },
804
+ {
805
+ "epoch": 2.0624647040047144,
806
+ "grad_norm": 19.177818298339844,
807
+ "learning_rate": 1.6157855547282204e-05,
808
+ "loss": 31.20944091796875,
809
+ "step": 10500
810
+ },
811
+ {
812
+ "epoch": 2.082107692685442,
813
+ "grad_norm": 20.949337005615234,
814
+ "learning_rate": 1.5819400257226023e-05,
815
+ "loss": 31.2146240234375,
816
+ "step": 10600
817
+ },
818
+ {
819
+ "epoch": 2.1017506813661697,
820
+ "grad_norm": 19.25591468811035,
821
+ "learning_rate": 1.548094496716984e-05,
822
+ "loss": 31.202607421875,
823
+ "step": 10700
824
+ },
825
+ {
826
+ "epoch": 2.121393670046898,
827
+ "grad_norm": 18.960092544555664,
828
+ "learning_rate": 1.5142489677113653e-05,
829
+ "loss": 31.14611572265625,
830
+ "step": 10800
831
+ },
832
+ {
833
+ "epoch": 2.1410366587276255,
834
+ "grad_norm": 18.479068756103516,
835
+ "learning_rate": 1.4804034387057469e-05,
836
+ "loss": 31.26153564453125,
837
+ "step": 10900
838
+ },
839
+ {
840
+ "epoch": 2.160679647408353,
841
+ "grad_norm": 21.587387084960938,
842
+ "learning_rate": 1.4465579097001287e-05,
843
+ "loss": 31.1222998046875,
844
+ "step": 11000
845
+ },
846
+ {
847
+ "epoch": 2.180322636089081,
848
+ "grad_norm": 17.947052001953125,
849
+ "learning_rate": 1.4127123806945102e-05,
850
+ "loss": 31.08917236328125,
851
+ "step": 11100
852
+ },
853
+ {
854
+ "epoch": 2.199965624769809,
855
+ "grad_norm": 19.169307708740234,
856
+ "learning_rate": 1.378866851688892e-05,
857
+ "loss": 31.0661474609375,
858
+ "step": 11200
859
+ },
860
+ {
861
+ "epoch": 2.2196086134505366,
862
+ "grad_norm": 16.882522583007812,
863
+ "learning_rate": 1.3450213226832736e-05,
864
+ "loss": 30.9886328125,
865
+ "step": 11300
866
+ },
867
+ {
868
+ "epoch": 2.2392516021312643,
869
+ "grad_norm": 19.624177932739258,
870
+ "learning_rate": 1.3111757936776553e-05,
871
+ "loss": 30.93468505859375,
872
+ "step": 11400
873
+ },
874
+ {
875
+ "epoch": 2.2486802366980134,
876
+ "eval_loss": 3.786958694458008,
877
+ "eval_runtime": 41.8524,
878
+ "eval_samples_per_second": 477.869,
879
+ "eval_steps_per_second": 14.933,
880
+ "step": 11448
881
+ },
882
+ {
883
+ "epoch": 2.258894590811992,
884
+ "grad_norm": 20.477542877197266,
885
+ "learning_rate": 1.2773302646720369e-05,
886
+ "loss": 31.00721923828125,
887
+ "step": 11500
888
+ },
889
+ {
890
+ "epoch": 2.2785375794927196,
891
+ "grad_norm": 19.928098678588867,
892
+ "learning_rate": 1.2434847356664185e-05,
893
+ "loss": 30.95269775390625,
894
+ "step": 11600
895
+ },
896
+ {
897
+ "epoch": 2.2981805681734477,
898
+ "grad_norm": 19.002788543701172,
899
+ "learning_rate": 1.2096392066608003e-05,
900
+ "loss": 30.908701171875,
901
+ "step": 11700
902
+ },
903
+ {
904
+ "epoch": 2.3178235568541754,
905
+ "grad_norm": 20.50242805480957,
906
+ "learning_rate": 1.1757936776551818e-05,
907
+ "loss": 30.96546142578125,
908
+ "step": 11800
909
+ },
910
+ {
911
+ "epoch": 2.337466545534903,
912
+ "grad_norm": 20.48063850402832,
913
+ "learning_rate": 1.1419481486495634e-05,
914
+ "loss": 30.9563623046875,
915
+ "step": 11900
916
+ },
917
+ {
918
+ "epoch": 2.3571095342156307,
919
+ "grad_norm": 19.522266387939453,
920
+ "learning_rate": 1.108102619643945e-05,
921
+ "loss": 30.85960205078125,
922
+ "step": 12000
923
+ },
924
+ {
925
+ "epoch": 2.376752522896359,
926
+ "grad_norm": 21.33004379272461,
927
+ "learning_rate": 1.0742570906383268e-05,
928
+ "loss": 30.829111328125,
929
+ "step": 12100
930
+ },
931
+ {
932
+ "epoch": 2.3963955115770865,
933
+ "grad_norm": 20.311534881591797,
934
+ "learning_rate": 1.0404115616327083e-05,
935
+ "loss": 30.93859619140625,
936
+ "step": 12200
937
+ },
938
+ {
939
+ "epoch": 2.416038500257814,
940
+ "grad_norm": 20.128795623779297,
941
+ "learning_rate": 1.00656603262709e-05,
942
+ "loss": 30.750810546875,
943
+ "step": 12300
944
+ },
945
+ {
946
+ "epoch": 2.435681488938542,
947
+ "grad_norm": 22.28921890258789,
948
+ "learning_rate": 9.727205036214717e-06,
949
+ "loss": 30.714560546875,
950
+ "step": 12400
951
+ },
952
+ {
953
+ "epoch": 2.4553244776192695,
954
+ "grad_norm": 24.13454818725586,
955
+ "learning_rate": 9.388749746158533e-06,
956
+ "loss": 30.87623779296875,
957
+ "step": 12500
958
+ },
959
+ {
960
+ "epoch": 2.4749674662999976,
961
+ "grad_norm": 20.58381462097168,
962
+ "learning_rate": 9.05029445610235e-06,
963
+ "loss": 30.60492431640625,
964
+ "step": 12600
965
+ },
966
+ {
967
+ "epoch": 2.4946104549807253,
968
+ "grad_norm": 20.045475006103516,
969
+ "learning_rate": 8.711839166046164e-06,
970
+ "loss": 30.8008154296875,
971
+ "step": 12700
972
+ },
973
+ {
974
+ "epoch": 2.498539052716871,
975
+ "eval_loss": 3.7586019039154053,
976
+ "eval_runtime": 42.0034,
977
+ "eval_samples_per_second": 476.152,
978
+ "eval_steps_per_second": 14.88,
979
+ "step": 12720
980
+ },
981
+ {
982
+ "epoch": 2.514253443661453,
983
+ "grad_norm": 19.53034210205078,
984
+ "learning_rate": 8.373383875989982e-06,
985
+ "loss": 30.7534326171875,
986
+ "step": 12800
987
+ },
988
+ {
989
+ "epoch": 2.533896432342181,
990
+ "grad_norm": 20.510520935058594,
991
+ "learning_rate": 8.0349285859338e-06,
992
+ "loss": 30.7755859375,
993
+ "step": 12900
994
+ },
995
+ {
996
+ "epoch": 2.5535394210229088,
997
+ "grad_norm": 20.725147247314453,
998
+ "learning_rate": 7.696473295877615e-06,
999
+ "loss": 30.9005029296875,
1000
+ "step": 13000
1001
+ },
1002
+ {
1003
+ "epoch": 2.5731824097036364,
1004
+ "grad_norm": 20.11240577697754,
1005
+ "learning_rate": 7.358018005821431e-06,
1006
+ "loss": 30.81609130859375,
1007
+ "step": 13100
1008
+ },
1009
+ {
1010
+ "epoch": 2.592825398384364,
1011
+ "grad_norm": 19.01041603088379,
1012
+ "learning_rate": 7.019562715765248e-06,
1013
+ "loss": 30.70421142578125,
1014
+ "step": 13200
1015
+ },
1016
+ {
1017
+ "epoch": 2.6124683870650918,
1018
+ "grad_norm": 20.232532501220703,
1019
+ "learning_rate": 6.681107425709064e-06,
1020
+ "loss": 30.7105322265625,
1021
+ "step": 13300
1022
+ },
1023
+ {
1024
+ "epoch": 2.63211137574582,
1025
+ "grad_norm": 21.33913803100586,
1026
+ "learning_rate": 6.342652135652881e-06,
1027
+ "loss": 30.76764892578125,
1028
+ "step": 13400
1029
+ },
1030
+ {
1031
+ "epoch": 2.6517543644265475,
1032
+ "grad_norm": 19.718833923339844,
1033
+ "learning_rate": 6.004196845596697e-06,
1034
+ "loss": 30.75705078125,
1035
+ "step": 13500
1036
+ },
1037
+ {
1038
+ "epoch": 2.671397353107275,
1039
+ "grad_norm": 20.983705520629883,
1040
+ "learning_rate": 5.665741555540514e-06,
1041
+ "loss": 30.66492431640625,
1042
+ "step": 13600
1043
+ },
1044
+ {
1045
+ "epoch": 2.691040341788003,
1046
+ "grad_norm": 18.726970672607422,
1047
+ "learning_rate": 5.3272862654843295e-06,
1048
+ "loss": 30.72262451171875,
1049
+ "step": 13700
1050
+ },
1051
+ {
1052
+ "epoch": 2.710683330468731,
1053
+ "grad_norm": 21.197751998901367,
1054
+ "learning_rate": 4.988830975428146e-06,
1055
+ "loss": 30.6771728515625,
1056
+ "step": 13800
1057
+ },
1058
+ {
1059
+ "epoch": 2.7303263191494587,
1060
+ "grad_norm": 21.318998336791992,
1061
+ "learning_rate": 4.650375685371963e-06,
1062
+ "loss": 30.52608642578125,
1063
+ "step": 13900
1064
+ },
1065
+ {
1066
+ "epoch": 2.748397868735728,
1067
+ "eval_loss": 3.7217817306518555,
1068
+ "eval_runtime": 41.8977,
1069
+ "eval_samples_per_second": 477.353,
1070
+ "eval_steps_per_second": 14.917,
1071
+ "step": 13992
1072
+ }
1073
+ ],
1074
+ "logging_steps": 100,
1075
+ "max_steps": 15273,
1076
+ "num_input_tokens_seen": 0,
1077
+ "num_train_epochs": 3,
1078
+ "save_steps": 1272,
1079
+ "stateful_callbacks": {
1080
+ "TrainerControl": {
1081
+ "args": {
1082
+ "should_epoch_stop": false,
1083
+ "should_evaluate": false,
1084
+ "should_log": false,
1085
+ "should_save": true,
1086
+ "should_training_stop": false
1087
+ },
1088
+ "attributes": {}
1089
+ }
1090
+ },
1091
+ "total_flos": 9.427785384950231e+17,
1092
+ "train_batch_size": 32,
1093
+ "trial_name": null,
1094
+ "trial_params": null
1095
+ }
checkpoint-13992/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaec524752280134c2e0387d7f5f1e2cc6d34eaa3f289327a642e9b1d7d2b9c9
3
+ size 5137
checkpoint-15273/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 1,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 514,
19
+ "model_type": "roberta",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "tie_word_embeddings": true,
24
+ "transformers_version": "5.0.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": false,
27
+ "vocab_size": 32001
28
+ }
checkpoint-15273/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e84d91ed965ee8a4304e29704132a4eb31a020da83620feeb85893d674be3ef
3
+ size 442633860
checkpoint-15273/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:936ee266876e662dcf56a141dcffd2bdef49fc3fb413397030a5008dea1a9577
3
+ size 885391563
checkpoint-15273/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c587c86f3d5a4c4b98c761a28cc64fdb870f6214dbf5835d1fcb43dedf56307
3
+ size 14645
checkpoint-15273/scaler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbf8b230b03e6bd05860655271ad5a73c7151e2784bd60ccf16d5bcc0db59048
3
+ size 1383
checkpoint-15273/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f78a79bb217a80efbb6f1c2b98c1488f65ab61871236f43f21776e7b9d42b60
3
+ size 1465
checkpoint-15273/trainer_state.json ADDED
@@ -0,0 +1,1194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 13992,
3
+ "best_metric": 3.7217817306518555,
4
+ "best_model_checkpoint": "sindhibert_session2/checkpoint-13992",
5
+ "epoch": 3.0,
6
+ "eval_steps": 1272,
7
+ "global_step": 15273,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.019642988680727773,
14
+ "grad_norm": 16.122173309326172,
15
+ "learning_rate": 9.900000000000002e-06,
16
+ "loss": 36.5051904296875,
17
+ "step": 100
18
+ },
19
+ {
20
+ "epoch": 0.039285977361455546,
21
+ "grad_norm": 18.11626434326172,
22
+ "learning_rate": 1.9900000000000003e-05,
23
+ "loss": 36.368505859375,
24
+ "step": 200
25
+ },
26
+ {
27
+ "epoch": 0.05892896604218332,
28
+ "grad_norm": 18.927509307861328,
29
+ "learning_rate": 2.9900000000000002e-05,
30
+ "loss": 36.45832275390625,
31
+ "step": 300
32
+ },
33
+ {
34
+ "epoch": 0.07857195472291109,
35
+ "grad_norm": 18.480167388916016,
36
+ "learning_rate": 3.99e-05,
37
+ "loss": 36.31460693359375,
38
+ "step": 400
39
+ },
40
+ {
41
+ "epoch": 0.09821494340363886,
42
+ "grad_norm": 19.010652542114258,
43
+ "learning_rate": 4.99e-05,
44
+ "loss": 36.24136962890625,
45
+ "step": 500
46
+ },
47
+ {
48
+ "epoch": 0.11785793208436664,
49
+ "grad_norm": 19.11362075805664,
50
+ "learning_rate": 4.966492926284438e-05,
51
+ "loss": 36.54402587890625,
52
+ "step": 600
53
+ },
54
+ {
55
+ "epoch": 0.1375009207650944,
56
+ "grad_norm": 19.41926383972168,
57
+ "learning_rate": 4.93264739727882e-05,
58
+ "loss": 36.28423095703125,
59
+ "step": 700
60
+ },
61
+ {
62
+ "epoch": 0.15714390944582218,
63
+ "grad_norm": 20.20358657836914,
64
+ "learning_rate": 4.898801868273201e-05,
65
+ "loss": 36.2180615234375,
66
+ "step": 800
67
+ },
68
+ {
69
+ "epoch": 0.17678689812654996,
70
+ "grad_norm": 18.598957061767578,
71
+ "learning_rate": 4.864956339267583e-05,
72
+ "loss": 36.1883984375,
73
+ "step": 900
74
+ },
75
+ {
76
+ "epoch": 0.19642988680727771,
77
+ "grad_norm": 18.320087432861328,
78
+ "learning_rate": 4.831110810261965e-05,
79
+ "loss": 36.0933837890625,
80
+ "step": 1000
81
+ },
82
+ {
83
+ "epoch": 0.2160728754880055,
84
+ "grad_norm": 21.353364944458008,
85
+ "learning_rate": 4.797265281256346e-05,
86
+ "loss": 35.95175048828125,
87
+ "step": 1100
88
+ },
89
+ {
90
+ "epoch": 0.23571586416873327,
91
+ "grad_norm": 19.709365844726562,
92
+ "learning_rate": 4.763419752250728e-05,
93
+ "loss": 35.99169921875,
94
+ "step": 1200
95
+ },
96
+ {
97
+ "epoch": 0.24985881601885726,
98
+ "eval_loss": 4.386976718902588,
99
+ "eval_runtime": 41.9212,
100
+ "eval_samples_per_second": 477.086,
101
+ "eval_steps_per_second": 14.909,
102
+ "step": 1272
103
+ },
104
+ {
105
+ "epoch": 0.255358852849461,
106
+ "grad_norm": 17.78667449951172,
107
+ "learning_rate": 4.729574223245109e-05,
108
+ "loss": 35.86061279296875,
109
+ "step": 1300
110
+ },
111
+ {
112
+ "epoch": 0.2750018415301888,
113
+ "grad_norm": 21.036474227905273,
114
+ "learning_rate": 4.695728694239491e-05,
115
+ "loss": 35.78255859375,
116
+ "step": 1400
117
+ },
118
+ {
119
+ "epoch": 0.2946448302109166,
120
+ "grad_norm": 19.002052307128906,
121
+ "learning_rate": 4.661883165233873e-05,
122
+ "loss": 35.5993408203125,
123
+ "step": 1500
124
+ },
125
+ {
126
+ "epoch": 0.31428781889164437,
127
+ "grad_norm": 19.828800201416016,
128
+ "learning_rate": 4.6280376362282543e-05,
129
+ "loss": 35.70918701171875,
130
+ "step": 1600
131
+ },
132
+ {
133
+ "epoch": 0.33393080757237215,
134
+ "grad_norm": 17.12042236328125,
135
+ "learning_rate": 4.594192107222636e-05,
136
+ "loss": 35.58583251953125,
137
+ "step": 1700
138
+ },
139
+ {
140
+ "epoch": 0.3535737962530999,
141
+ "grad_norm": 19.869712829589844,
142
+ "learning_rate": 4.560346578217018e-05,
143
+ "loss": 35.3600537109375,
144
+ "step": 1800
145
+ },
146
+ {
147
+ "epoch": 0.3732167849338277,
148
+ "grad_norm": 18.914470672607422,
149
+ "learning_rate": 4.5265010492113994e-05,
150
+ "loss": 35.34132080078125,
151
+ "step": 1900
152
+ },
153
+ {
154
+ "epoch": 0.39285977361455543,
155
+ "grad_norm": 18.888940811157227,
156
+ "learning_rate": 4.4926555202057814e-05,
157
+ "loss": 35.33590576171875,
158
+ "step": 2000
159
+ },
160
+ {
161
+ "epoch": 0.4125027622952832,
162
+ "grad_norm": 20.27227783203125,
163
+ "learning_rate": 4.4588099912001626e-05,
164
+ "loss": 35.22805908203125,
165
+ "step": 2100
166
+ },
167
+ {
168
+ "epoch": 0.432145750976011,
169
+ "grad_norm": 15.550501823425293,
170
+ "learning_rate": 4.424964462194544e-05,
171
+ "loss": 35.256953125,
172
+ "step": 2200
173
+ },
174
+ {
175
+ "epoch": 0.45178873965673877,
176
+ "grad_norm": 17.671451568603516,
177
+ "learning_rate": 4.391118933188926e-05,
178
+ "loss": 35.056826171875,
179
+ "step": 2300
180
+ },
181
+ {
182
+ "epoch": 0.47143172833746655,
183
+ "grad_norm": 18.74838638305664,
184
+ "learning_rate": 4.357273404183308e-05,
185
+ "loss": 34.9271826171875,
186
+ "step": 2400
187
+ },
188
+ {
189
+ "epoch": 0.49107471701819433,
190
+ "grad_norm": 20.912931442260742,
191
+ "learning_rate": 4.323427875177689e-05,
192
+ "loss": 34.793916015625,
193
+ "step": 2500
194
+ },
195
+ {
196
+ "epoch": 0.4997176320377145,
197
+ "eval_loss": 4.270185947418213,
198
+ "eval_runtime": 41.8695,
199
+ "eval_samples_per_second": 477.675,
200
+ "eval_steps_per_second": 14.927,
201
+ "step": 2544
202
+ },
203
+ {
204
+ "epoch": 0.510717705698922,
205
+ "grad_norm": 18.11321258544922,
206
+ "learning_rate": 4.289582346172071e-05,
207
+ "loss": 34.97314453125,
208
+ "step": 2600
209
+ },
210
+ {
211
+ "epoch": 0.5303606943796498,
212
+ "grad_norm": 19.38896942138672,
213
+ "learning_rate": 4.255736817166453e-05,
214
+ "loss": 34.873974609375,
215
+ "step": 2700
216
+ },
217
+ {
218
+ "epoch": 0.5500036830603776,
219
+ "grad_norm": 19.479278564453125,
220
+ "learning_rate": 4.221891288160834e-05,
221
+ "loss": 34.713251953125,
222
+ "step": 2800
223
+ },
224
+ {
225
+ "epoch": 0.5696466717411054,
226
+ "grad_norm": 19.849210739135742,
227
+ "learning_rate": 4.188045759155216e-05,
228
+ "loss": 34.66454833984375,
229
+ "step": 2900
230
+ },
231
+ {
232
+ "epoch": 0.5892896604218332,
233
+ "grad_norm": 19.59630012512207,
234
+ "learning_rate": 4.154200230149597e-05,
235
+ "loss": 34.546259765625,
236
+ "step": 3000
237
+ },
238
+ {
239
+ "epoch": 0.608932649102561,
240
+ "grad_norm": 18.32138442993164,
241
+ "learning_rate": 4.120354701143979e-05,
242
+ "loss": 34.460576171875,
243
+ "step": 3100
244
+ },
245
+ {
246
+ "epoch": 0.6285756377832887,
247
+ "grad_norm": 17.825464248657227,
248
+ "learning_rate": 4.086509172138361e-05,
249
+ "loss": 34.3186962890625,
250
+ "step": 3200
251
+ },
252
+ {
253
+ "epoch": 0.6482186264640165,
254
+ "grad_norm": 19.180105209350586,
255
+ "learning_rate": 4.052663643132742e-05,
256
+ "loss": 34.40132080078125,
257
+ "step": 3300
258
+ },
259
+ {
260
+ "epoch": 0.6678616151447443,
261
+ "grad_norm": 18.498641967773438,
262
+ "learning_rate": 4.018818114127124e-05,
263
+ "loss": 34.2248828125,
264
+ "step": 3400
265
+ },
266
+ {
267
+ "epoch": 0.6875046038254721,
268
+ "grad_norm": 19.08323097229004,
269
+ "learning_rate": 3.984972585121506e-05,
270
+ "loss": 33.97822265625,
271
+ "step": 3500
272
+ },
273
+ {
274
+ "epoch": 0.7071475925061999,
275
+ "grad_norm": 20.13410758972168,
276
+ "learning_rate": 3.9511270561158874e-05,
277
+ "loss": 34.1056494140625,
278
+ "step": 3600
279
+ },
280
+ {
281
+ "epoch": 0.7267905811869276,
282
+ "grad_norm": 18.82459259033203,
283
+ "learning_rate": 3.917281527110269e-05,
284
+ "loss": 34.2245947265625,
285
+ "step": 3700
286
+ },
287
+ {
288
+ "epoch": 0.7464335698676554,
289
+ "grad_norm": 17.352100372314453,
290
+ "learning_rate": 3.8834359981046505e-05,
291
+ "loss": 34.04242919921875,
292
+ "step": 3800
293
+ },
294
+ {
295
+ "epoch": 0.7495764480565718,
296
+ "eval_loss": 4.178581237792969,
297
+ "eval_runtime": 42.0642,
298
+ "eval_samples_per_second": 475.463,
299
+ "eval_steps_per_second": 14.858,
300
+ "step": 3816
301
+ },
302
+ {
303
+ "epoch": 0.7660765585483832,
304
+ "grad_norm": 19.103435516357422,
305
+ "learning_rate": 3.849590469099032e-05,
306
+ "loss": 34.01751708984375,
307
+ "step": 3900
308
+ },
309
+ {
310
+ "epoch": 0.7857195472291109,
311
+ "grad_norm": 18.045913696289062,
312
+ "learning_rate": 3.815744940093414e-05,
313
+ "loss": 33.90669921875,
314
+ "step": 4000
315
+ },
316
+ {
317
+ "epoch": 0.8053625359098386,
318
+ "grad_norm": 18.76168441772461,
319
+ "learning_rate": 3.7818994110877956e-05,
320
+ "loss": 33.872587890625,
321
+ "step": 4100
322
+ },
323
+ {
324
+ "epoch": 0.8250055245905664,
325
+ "grad_norm": 16.547574996948242,
326
+ "learning_rate": 3.748053882082177e-05,
327
+ "loss": 33.8643359375,
328
+ "step": 4200
329
+ },
330
+ {
331
+ "epoch": 0.8446485132712942,
332
+ "grad_norm": 18.636455535888672,
333
+ "learning_rate": 3.714208353076559e-05,
334
+ "loss": 33.7685595703125,
335
+ "step": 4300
336
+ },
337
+ {
338
+ "epoch": 0.864291501952022,
339
+ "grad_norm": 18.742900848388672,
340
+ "learning_rate": 3.68036282407094e-05,
341
+ "loss": 33.53510009765625,
342
+ "step": 4400
343
+ },
344
+ {
345
+ "epoch": 0.8839344906327498,
346
+ "grad_norm": 20.976703643798828,
347
+ "learning_rate": 3.646517295065322e-05,
348
+ "loss": 33.42730224609375,
349
+ "step": 4500
350
+ },
351
+ {
352
+ "epoch": 0.9035774793134775,
353
+ "grad_norm": 18.552316665649414,
354
+ "learning_rate": 3.612671766059704e-05,
355
+ "loss": 33.53127197265625,
356
+ "step": 4600
357
+ },
358
+ {
359
+ "epoch": 0.9232204679942053,
360
+ "grad_norm": 21.41478157043457,
361
+ "learning_rate": 3.578826237054085e-05,
362
+ "loss": 33.41383544921875,
363
+ "step": 4700
364
+ },
365
+ {
366
+ "epoch": 0.9428634566749331,
367
+ "grad_norm": 19.785966873168945,
368
+ "learning_rate": 3.544980708048467e-05,
369
+ "loss": 33.38103271484375,
370
+ "step": 4800
371
+ },
372
+ {
373
+ "epoch": 0.9625064453556609,
374
+ "grad_norm": 17.69455337524414,
375
+ "learning_rate": 3.511135179042849e-05,
376
+ "loss": 33.4733447265625,
377
+ "step": 4900
378
+ },
379
+ {
380
+ "epoch": 0.9821494340363887,
381
+ "grad_norm": 20.246673583984375,
382
+ "learning_rate": 3.47728965003723e-05,
383
+ "loss": 33.4089208984375,
384
+ "step": 5000
385
+ },
386
+ {
387
+ "epoch": 0.999435264075429,
388
+ "eval_loss": 4.0612263679504395,
389
+ "eval_runtime": 42.4302,
390
+ "eval_samples_per_second": 471.362,
391
+ "eval_steps_per_second": 14.73,
392
+ "step": 5088
393
+ },
394
+ {
395
+ "epoch": 1.0017678689812655,
396
+ "grad_norm": 19.691932678222656,
397
+ "learning_rate": 3.443444121031612e-05,
398
+ "loss": 33.1838037109375,
399
+ "step": 5100
400
+ },
401
+ {
402
+ "epoch": 1.0214108576619934,
403
+ "grad_norm": 18.13388442993164,
404
+ "learning_rate": 3.409598592025994e-05,
405
+ "loss": 33.21130859375,
406
+ "step": 5200
407
+ },
408
+ {
409
+ "epoch": 1.041053846342721,
410
+ "grad_norm": 18.41756248474121,
411
+ "learning_rate": 3.375753063020375e-05,
412
+ "loss": 33.07806396484375,
413
+ "step": 5300
414
+ },
415
+ {
416
+ "epoch": 1.060696835023449,
417
+ "grad_norm": 18.85491943359375,
418
+ "learning_rate": 3.3419075340147566e-05,
419
+ "loss": 32.95417724609375,
420
+ "step": 5400
421
+ },
422
+ {
423
+ "epoch": 1.0803398237041766,
424
+ "grad_norm": 20.03909683227539,
425
+ "learning_rate": 3.3080620050091385e-05,
426
+ "loss": 32.93142822265625,
427
+ "step": 5500
428
+ },
429
+ {
430
+ "epoch": 1.0999828123849045,
431
+ "grad_norm": 19.49604606628418,
432
+ "learning_rate": 3.27421647600352e-05,
433
+ "loss": 32.9296630859375,
434
+ "step": 5600
435
+ },
436
+ {
437
+ "epoch": 1.1196258010656321,
438
+ "grad_norm": 21.259004592895508,
439
+ "learning_rate": 3.2403709469979017e-05,
440
+ "loss": 32.85189453125,
441
+ "step": 5700
442
+ },
443
+ {
444
+ "epoch": 1.1392687897463598,
445
+ "grad_norm": 19.597267150878906,
446
+ "learning_rate": 3.206525417992283e-05,
447
+ "loss": 32.82923583984375,
448
+ "step": 5800
449
+ },
450
+ {
451
+ "epoch": 1.1589117784270877,
452
+ "grad_norm": 20.224699020385742,
453
+ "learning_rate": 3.172679888986665e-05,
454
+ "loss": 32.7767626953125,
455
+ "step": 5900
456
+ },
457
+ {
458
+ "epoch": 1.1785547671078154,
459
+ "grad_norm": 18.452495574951172,
460
+ "learning_rate": 3.138834359981047e-05,
461
+ "loss": 32.6592626953125,
462
+ "step": 6000
463
+ },
464
+ {
465
+ "epoch": 1.1981977557885433,
466
+ "grad_norm": 19.971717834472656,
467
+ "learning_rate": 3.104988830975428e-05,
468
+ "loss": 32.80323974609375,
469
+ "step": 6100
470
+ },
471
+ {
472
+ "epoch": 1.217840744469271,
473
+ "grad_norm": 17.584882736206055,
474
+ "learning_rate": 3.07114330196981e-05,
475
+ "loss": 32.6551416015625,
476
+ "step": 6200
477
+ },
478
+ {
479
+ "epoch": 1.2374837331499988,
480
+ "grad_norm": 19.49502944946289,
481
+ "learning_rate": 3.0372977729641915e-05,
482
+ "loss": 32.73489990234375,
483
+ "step": 6300
484
+ },
485
+ {
486
+ "epoch": 1.2492695263584355,
487
+ "eval_loss": 3.9626989364624023,
488
+ "eval_runtime": 41.8921,
489
+ "eval_samples_per_second": 477.418,
490
+ "eval_steps_per_second": 14.919,
491
+ "step": 6360
492
+ },
493
+ {
494
+ "epoch": 1.2571267218307265,
495
+ "grad_norm": 20.214130401611328,
496
+ "learning_rate": 3.0034522439585734e-05,
497
+ "loss": 32.55817138671875,
498
+ "step": 6400
499
+ },
500
+ {
501
+ "epoch": 1.2767697105114544,
502
+ "grad_norm": 20.668527603149414,
503
+ "learning_rate": 2.969606714952955e-05,
504
+ "loss": 32.53630615234375,
505
+ "step": 6500
506
+ },
507
+ {
508
+ "epoch": 1.296412699192182,
509
+ "grad_norm": 18.479408264160156,
510
+ "learning_rate": 2.9357611859473366e-05,
511
+ "loss": 32.464462890625,
512
+ "step": 6600
513
+ },
514
+ {
515
+ "epoch": 1.31605568787291,
516
+ "grad_norm": 19.027793884277344,
517
+ "learning_rate": 2.9019156569417182e-05,
518
+ "loss": 32.48625244140625,
519
+ "step": 6700
520
+ },
521
+ {
522
+ "epoch": 1.3356986765536376,
523
+ "grad_norm": 19.871105194091797,
524
+ "learning_rate": 2.8680701279361e-05,
525
+ "loss": 32.41595947265625,
526
+ "step": 6800
527
+ },
528
+ {
529
+ "epoch": 1.3553416652343655,
530
+ "grad_norm": 19.916994094848633,
531
+ "learning_rate": 2.8342245989304817e-05,
532
+ "loss": 32.11419921875,
533
+ "step": 6900
534
+ },
535
+ {
536
+ "epoch": 1.3749846539150932,
537
+ "grad_norm": 21.212909698486328,
538
+ "learning_rate": 2.8003790699248633e-05,
539
+ "loss": 32.30314453125,
540
+ "step": 7000
541
+ },
542
+ {
543
+ "epoch": 1.3946276425958208,
544
+ "grad_norm": 25.216768264770508,
545
+ "learning_rate": 2.7665335409192445e-05,
546
+ "loss": 32.19344482421875,
547
+ "step": 7100
548
+ },
549
+ {
550
+ "epoch": 1.4142706312765487,
551
+ "grad_norm": 19.619844436645508,
552
+ "learning_rate": 2.732688011913626e-05,
553
+ "loss": 32.30953125,
554
+ "step": 7200
555
+ },
556
+ {
557
+ "epoch": 1.4339136199572766,
558
+ "grad_norm": 21.061376571655273,
559
+ "learning_rate": 2.6988424829080077e-05,
560
+ "loss": 32.3416162109375,
561
+ "step": 7300
562
+ },
563
+ {
564
+ "epoch": 1.4535566086380043,
565
+ "grad_norm": 18.674562454223633,
566
+ "learning_rate": 2.6649969539023896e-05,
567
+ "loss": 32.2662744140625,
568
+ "step": 7400
569
+ },
570
+ {
571
+ "epoch": 1.473199597318732,
572
+ "grad_norm": 18.776655197143555,
573
+ "learning_rate": 2.6311514248967712e-05,
574
+ "loss": 32.07302978515625,
575
+ "step": 7500
576
+ },
577
+ {
578
+ "epoch": 1.4928425859994598,
579
+ "grad_norm": 19.0480899810791,
580
+ "learning_rate": 2.5973058958911528e-05,
581
+ "loss": 32.19434326171875,
582
+ "step": 7600
583
+ },
584
+ {
585
+ "epoch": 1.4991283423772928,
586
+ "eval_loss": 3.9200026988983154,
587
+ "eval_runtime": 42.1732,
588
+ "eval_samples_per_second": 474.235,
589
+ "eval_steps_per_second": 14.82,
590
+ "step": 7632
591
+ },
592
+ {
593
+ "epoch": 1.5124855746801877,
594
+ "grad_norm": 18.192241668701172,
595
+ "learning_rate": 2.5634603668855344e-05,
596
+ "loss": 32.0382470703125,
597
+ "step": 7700
598
+ },
599
+ {
600
+ "epoch": 1.5321285633609154,
601
+ "grad_norm": 21.64850425720215,
602
+ "learning_rate": 2.5296148378799163e-05,
603
+ "loss": 32.0071484375,
604
+ "step": 7800
605
+ },
606
+ {
607
+ "epoch": 1.551771552041643,
608
+ "grad_norm": 21.07256507873535,
609
+ "learning_rate": 2.495769308874298e-05,
610
+ "loss": 32.02736328125,
611
+ "step": 7900
612
+ },
613
+ {
614
+ "epoch": 1.5714145407223707,
615
+ "grad_norm": 18.811485290527344,
616
+ "learning_rate": 2.4619237798686794e-05,
617
+ "loss": 32.18232666015625,
618
+ "step": 8000
619
+ },
620
+ {
621
+ "epoch": 1.5910575294030986,
622
+ "grad_norm": 20.226411819458008,
623
+ "learning_rate": 2.428078250863061e-05,
624
+ "loss": 31.81751220703125,
625
+ "step": 8100
626
+ },
627
+ {
628
+ "epoch": 1.6107005180838265,
629
+ "grad_norm": 21.44918441772461,
630
+ "learning_rate": 2.394232721857443e-05,
631
+ "loss": 31.89043212890625,
632
+ "step": 8200
633
+ },
634
+ {
635
+ "epoch": 1.6303435067645542,
636
+ "grad_norm": 19.660367965698242,
637
+ "learning_rate": 2.3603871928518245e-05,
638
+ "loss": 31.95683349609375,
639
+ "step": 8300
640
+ },
641
+ {
642
+ "epoch": 1.6499864954452819,
643
+ "grad_norm": 19.144596099853516,
644
+ "learning_rate": 2.3265416638462058e-05,
645
+ "loss": 31.867197265625,
646
+ "step": 8400
647
+ },
648
+ {
649
+ "epoch": 1.6696294841260098,
650
+ "grad_norm": 18.604026794433594,
651
+ "learning_rate": 2.2926961348405877e-05,
652
+ "loss": 31.78265380859375,
653
+ "step": 8500
654
+ },
655
+ {
656
+ "epoch": 1.6892724728067376,
657
+ "grad_norm": 19.978652954101562,
658
+ "learning_rate": 2.2588506058349693e-05,
659
+ "loss": 31.79925048828125,
660
+ "step": 8600
661
+ },
662
+ {
663
+ "epoch": 1.7089154614874653,
664
+ "grad_norm": 18.18141746520996,
665
+ "learning_rate": 2.225005076829351e-05,
666
+ "loss": 31.853974609375,
667
+ "step": 8700
668
+ },
669
+ {
670
+ "epoch": 1.728558450168193,
671
+ "grad_norm": 17.99820899963379,
672
+ "learning_rate": 2.1911595478237325e-05,
673
+ "loss": 31.75265625,
674
+ "step": 8800
675
+ },
676
+ {
677
+ "epoch": 1.7482014388489209,
678
+ "grad_norm": 20.680606842041016,
679
+ "learning_rate": 2.1573140188181144e-05,
680
+ "loss": 31.6561328125,
681
+ "step": 8900
682
+ },
683
+ {
684
+ "epoch": 1.74898715839615,
685
+ "eval_loss": 3.863434076309204,
686
+ "eval_runtime": 42.0055,
687
+ "eval_samples_per_second": 476.128,
688
+ "eval_steps_per_second": 14.879,
689
+ "step": 8904
690
+ },
691
+ {
692
+ "epoch": 1.7678444275296488,
693
+ "grad_norm": 20.50802993774414,
694
+ "learning_rate": 2.123468489812496e-05,
695
+ "loss": 31.6072265625,
696
+ "step": 9000
697
+ },
698
+ {
699
+ "epoch": 1.7874874162103764,
700
+ "grad_norm": 21.482328414916992,
701
+ "learning_rate": 2.0896229608068775e-05,
702
+ "loss": 31.7250537109375,
703
+ "step": 9100
704
+ },
705
+ {
706
+ "epoch": 1.807130404891104,
707
+ "grad_norm": 19.20509910583496,
708
+ "learning_rate": 2.055777431801259e-05,
709
+ "loss": 31.58796875,
710
+ "step": 9200
711
+ },
712
+ {
713
+ "epoch": 1.826773393571832,
714
+ "grad_norm": 21.03694725036621,
715
+ "learning_rate": 2.0219319027956407e-05,
716
+ "loss": 31.68398193359375,
717
+ "step": 9300
718
+ },
719
+ {
720
+ "epoch": 1.8464163822525599,
721
+ "grad_norm": 18.272459030151367,
722
+ "learning_rate": 1.9880863737900223e-05,
723
+ "loss": 31.53086181640625,
724
+ "step": 9400
725
+ },
726
+ {
727
+ "epoch": 1.8660593709332876,
728
+ "grad_norm": 19.046916961669922,
729
+ "learning_rate": 1.9542408447844042e-05,
730
+ "loss": 31.525322265625,
731
+ "step": 9500
732
+ },
733
+ {
734
+ "epoch": 1.8857023596140152,
735
+ "grad_norm": 21.118305206298828,
736
+ "learning_rate": 1.9203953157787858e-05,
737
+ "loss": 31.52841552734375,
738
+ "step": 9600
739
+ },
740
+ {
741
+ "epoch": 1.905345348294743,
742
+ "grad_norm": 18.861080169677734,
743
+ "learning_rate": 1.8865497867731674e-05,
744
+ "loss": 31.4529345703125,
745
+ "step": 9700
746
+ },
747
+ {
748
+ "epoch": 1.9249883369754708,
749
+ "grad_norm": 20.4729061126709,
750
+ "learning_rate": 1.852704257767549e-05,
751
+ "loss": 31.35305419921875,
752
+ "step": 9800
753
+ },
754
+ {
755
+ "epoch": 1.9446313256561987,
756
+ "grad_norm": 17.702392578125,
757
+ "learning_rate": 1.818858728761931e-05,
758
+ "loss": 31.5552734375,
759
+ "step": 9900
760
+ },
761
+ {
762
+ "epoch": 1.9642743143369263,
763
+ "grad_norm": 21.927942276000977,
764
+ "learning_rate": 1.785013199756312e-05,
765
+ "loss": 31.42951416015625,
766
+ "step": 10000
767
+ },
768
+ {
769
+ "epoch": 1.983917303017654,
770
+ "grad_norm": 19.895252227783203,
771
+ "learning_rate": 1.7511676707506937e-05,
772
+ "loss": 31.416962890625,
773
+ "step": 10100
774
+ },
775
+ {
776
+ "epoch": 1.998845974415007,
777
+ "eval_loss": 3.819389820098877,
778
+ "eval_runtime": 41.9176,
779
+ "eval_samples_per_second": 477.126,
780
+ "eval_steps_per_second": 14.91,
781
+ "step": 10176
782
+ },
783
+ {
784
+ "epoch": 2.003535737962531,
785
+ "grad_norm": 20.209577560424805,
786
+ "learning_rate": 1.7173221417450756e-05,
787
+ "loss": 31.25757568359375,
788
+ "step": 10200
789
+ },
790
+ {
791
+ "epoch": 2.0231787266432586,
792
+ "grad_norm": 19.49869155883789,
793
+ "learning_rate": 1.6834766127394572e-05,
794
+ "loss": 31.14395263671875,
795
+ "step": 10300
796
+ },
797
+ {
798
+ "epoch": 2.0428217153239867,
799
+ "grad_norm": 20.60426139831543,
800
+ "learning_rate": 1.6496310837338388e-05,
801
+ "loss": 31.38017333984375,
802
+ "step": 10400
803
+ },
804
+ {
805
+ "epoch": 2.0624647040047144,
806
+ "grad_norm": 19.177818298339844,
807
+ "learning_rate": 1.6157855547282204e-05,
808
+ "loss": 31.20944091796875,
809
+ "step": 10500
810
+ },
811
+ {
812
+ "epoch": 2.082107692685442,
813
+ "grad_norm": 20.949337005615234,
814
+ "learning_rate": 1.5819400257226023e-05,
815
+ "loss": 31.2146240234375,
816
+ "step": 10600
817
+ },
818
+ {
819
+ "epoch": 2.1017506813661697,
820
+ "grad_norm": 19.25591468811035,
821
+ "learning_rate": 1.548094496716984e-05,
822
+ "loss": 31.202607421875,
823
+ "step": 10700
824
+ },
825
+ {
826
+ "epoch": 2.121393670046898,
827
+ "grad_norm": 18.960092544555664,
828
+ "learning_rate": 1.5142489677113653e-05,
829
+ "loss": 31.14611572265625,
830
+ "step": 10800
831
+ },
832
+ {
833
+ "epoch": 2.1410366587276255,
834
+ "grad_norm": 18.479068756103516,
835
+ "learning_rate": 1.4804034387057469e-05,
836
+ "loss": 31.26153564453125,
837
+ "step": 10900
838
+ },
839
+ {
840
+ "epoch": 2.160679647408353,
841
+ "grad_norm": 21.587387084960938,
842
+ "learning_rate": 1.4465579097001287e-05,
843
+ "loss": 31.1222998046875,
844
+ "step": 11000
845
+ },
846
+ {
847
+ "epoch": 2.180322636089081,
848
+ "grad_norm": 17.947052001953125,
849
+ "learning_rate": 1.4127123806945102e-05,
850
+ "loss": 31.08917236328125,
851
+ "step": 11100
852
+ },
853
+ {
854
+ "epoch": 2.199965624769809,
855
+ "grad_norm": 19.169307708740234,
856
+ "learning_rate": 1.378866851688892e-05,
857
+ "loss": 31.0661474609375,
858
+ "step": 11200
859
+ },
860
+ {
861
+ "epoch": 2.2196086134505366,
862
+ "grad_norm": 16.882522583007812,
863
+ "learning_rate": 1.3450213226832736e-05,
864
+ "loss": 30.9886328125,
865
+ "step": 11300
866
+ },
867
+ {
868
+ "epoch": 2.2392516021312643,
869
+ "grad_norm": 19.624177932739258,
870
+ "learning_rate": 1.3111757936776553e-05,
871
+ "loss": 30.93468505859375,
872
+ "step": 11400
873
+ },
874
+ {
875
+ "epoch": 2.2486802366980134,
876
+ "eval_loss": 3.786958694458008,
877
+ "eval_runtime": 41.8524,
878
+ "eval_samples_per_second": 477.869,
879
+ "eval_steps_per_second": 14.933,
880
+ "step": 11448
881
+ },
882
+ {
883
+ "epoch": 2.258894590811992,
884
+ "grad_norm": 20.477542877197266,
885
+ "learning_rate": 1.2773302646720369e-05,
886
+ "loss": 31.00721923828125,
887
+ "step": 11500
888
+ },
889
+ {
890
+ "epoch": 2.2785375794927196,
891
+ "grad_norm": 19.928098678588867,
892
+ "learning_rate": 1.2434847356664185e-05,
893
+ "loss": 30.95269775390625,
894
+ "step": 11600
895
+ },
896
+ {
897
+ "epoch": 2.2981805681734477,
898
+ "grad_norm": 19.002788543701172,
899
+ "learning_rate": 1.2096392066608003e-05,
900
+ "loss": 30.908701171875,
901
+ "step": 11700
902
+ },
903
+ {
904
+ "epoch": 2.3178235568541754,
905
+ "grad_norm": 20.50242805480957,
906
+ "learning_rate": 1.1757936776551818e-05,
907
+ "loss": 30.96546142578125,
908
+ "step": 11800
909
+ },
910
+ {
911
+ "epoch": 2.337466545534903,
912
+ "grad_norm": 20.48063850402832,
913
+ "learning_rate": 1.1419481486495634e-05,
914
+ "loss": 30.9563623046875,
915
+ "step": 11900
916
+ },
917
+ {
918
+ "epoch": 2.3571095342156307,
919
+ "grad_norm": 19.522266387939453,
920
+ "learning_rate": 1.108102619643945e-05,
921
+ "loss": 30.85960205078125,
922
+ "step": 12000
923
+ },
924
+ {
925
+ "epoch": 2.376752522896359,
926
+ "grad_norm": 21.33004379272461,
927
+ "learning_rate": 1.0742570906383268e-05,
928
+ "loss": 30.829111328125,
929
+ "step": 12100
930
+ },
931
+ {
932
+ "epoch": 2.3963955115770865,
933
+ "grad_norm": 20.311534881591797,
934
+ "learning_rate": 1.0404115616327083e-05,
935
+ "loss": 30.93859619140625,
936
+ "step": 12200
937
+ },
938
+ {
939
+ "epoch": 2.416038500257814,
940
+ "grad_norm": 20.128795623779297,
941
+ "learning_rate": 1.00656603262709e-05,
942
+ "loss": 30.750810546875,
943
+ "step": 12300
944
+ },
945
+ {
946
+ "epoch": 2.435681488938542,
947
+ "grad_norm": 22.28921890258789,
948
+ "learning_rate": 9.727205036214717e-06,
949
+ "loss": 30.714560546875,
950
+ "step": 12400
951
+ },
952
+ {
953
+ "epoch": 2.4553244776192695,
954
+ "grad_norm": 24.13454818725586,
955
+ "learning_rate": 9.388749746158533e-06,
956
+ "loss": 30.87623779296875,
957
+ "step": 12500
958
+ },
959
+ {
960
+ "epoch": 2.4749674662999976,
961
+ "grad_norm": 20.58381462097168,
962
+ "learning_rate": 9.05029445610235e-06,
963
+ "loss": 30.60492431640625,
964
+ "step": 12600
965
+ },
966
+ {
967
+ "epoch": 2.4946104549807253,
968
+ "grad_norm": 20.045475006103516,
969
+ "learning_rate": 8.711839166046164e-06,
970
+ "loss": 30.8008154296875,
971
+ "step": 12700
972
+ },
973
+ {
974
+ "epoch": 2.498539052716871,
975
+ "eval_loss": 3.7586019039154053,
976
+ "eval_runtime": 42.0034,
977
+ "eval_samples_per_second": 476.152,
978
+ "eval_steps_per_second": 14.88,
979
+ "step": 12720
980
+ },
981
+ {
982
+ "epoch": 2.514253443661453,
983
+ "grad_norm": 19.53034210205078,
984
+ "learning_rate": 8.373383875989982e-06,
985
+ "loss": 30.7534326171875,
986
+ "step": 12800
987
+ },
988
+ {
989
+ "epoch": 2.533896432342181,
990
+ "grad_norm": 20.510520935058594,
991
+ "learning_rate": 8.0349285859338e-06,
992
+ "loss": 30.7755859375,
993
+ "step": 12900
994
+ },
995
+ {
996
+ "epoch": 2.5535394210229088,
997
+ "grad_norm": 20.725147247314453,
998
+ "learning_rate": 7.696473295877615e-06,
999
+ "loss": 30.9005029296875,
1000
+ "step": 13000
1001
+ },
1002
+ {
1003
+ "epoch": 2.5731824097036364,
1004
+ "grad_norm": 20.11240577697754,
1005
+ "learning_rate": 7.358018005821431e-06,
1006
+ "loss": 30.81609130859375,
1007
+ "step": 13100
1008
+ },
1009
+ {
1010
+ "epoch": 2.592825398384364,
1011
+ "grad_norm": 19.01041603088379,
1012
+ "learning_rate": 7.019562715765248e-06,
1013
+ "loss": 30.70421142578125,
1014
+ "step": 13200
1015
+ },
1016
+ {
1017
+ "epoch": 2.6124683870650918,
1018
+ "grad_norm": 20.232532501220703,
1019
+ "learning_rate": 6.681107425709064e-06,
1020
+ "loss": 30.7105322265625,
1021
+ "step": 13300
1022
+ },
1023
+ {
1024
+ "epoch": 2.63211137574582,
1025
+ "grad_norm": 21.33913803100586,
1026
+ "learning_rate": 6.342652135652881e-06,
1027
+ "loss": 30.76764892578125,
1028
+ "step": 13400
1029
+ },
1030
+ {
1031
+ "epoch": 2.6517543644265475,
1032
+ "grad_norm": 19.718833923339844,
1033
+ "learning_rate": 6.004196845596697e-06,
1034
+ "loss": 30.75705078125,
1035
+ "step": 13500
1036
+ },
1037
+ {
1038
+ "epoch": 2.671397353107275,
1039
+ "grad_norm": 20.983705520629883,
1040
+ "learning_rate": 5.665741555540514e-06,
1041
+ "loss": 30.66492431640625,
1042
+ "step": 13600
1043
+ },
1044
+ {
1045
+ "epoch": 2.691040341788003,
1046
+ "grad_norm": 18.726970672607422,
1047
+ "learning_rate": 5.3272862654843295e-06,
1048
+ "loss": 30.72262451171875,
1049
+ "step": 13700
1050
+ },
1051
+ {
1052
+ "epoch": 2.710683330468731,
1053
+ "grad_norm": 21.197751998901367,
1054
+ "learning_rate": 4.988830975428146e-06,
1055
+ "loss": 30.6771728515625,
1056
+ "step": 13800
1057
+ },
1058
+ {
1059
+ "epoch": 2.7303263191494587,
1060
+ "grad_norm": 21.318998336791992,
1061
+ "learning_rate": 4.650375685371963e-06,
1062
+ "loss": 30.52608642578125,
1063
+ "step": 13900
1064
+ },
1065
+ {
1066
+ "epoch": 2.748397868735728,
1067
+ "eval_loss": 3.7217817306518555,
1068
+ "eval_runtime": 41.8977,
1069
+ "eval_samples_per_second": 477.353,
1070
+ "eval_steps_per_second": 14.917,
1071
+ "step": 13992
1072
+ },
1073
+ {
1074
+ "epoch": 2.7499693078301863,
1075
+ "grad_norm": 19.221782684326172,
1076
+ "learning_rate": 4.3119203953157795e-06,
1077
+ "loss": 30.623740234375,
1078
+ "step": 14000
1079
+ },
1080
+ {
1081
+ "epoch": 2.769612296510914,
1082
+ "grad_norm": 17.869342803955078,
1083
+ "learning_rate": 3.973465105259595e-06,
1084
+ "loss": 30.61525634765625,
1085
+ "step": 14100
1086
+ },
1087
+ {
1088
+ "epoch": 2.7892552851916417,
1089
+ "grad_norm": 18.45319938659668,
1090
+ "learning_rate": 3.635009815203412e-06,
1091
+ "loss": 30.6033154296875,
1092
+ "step": 14200
1093
+ },
1094
+ {
1095
+ "epoch": 2.80889827387237,
1096
+ "grad_norm": 18.811376571655273,
1097
+ "learning_rate": 3.296554525147228e-06,
1098
+ "loss": 30.6585498046875,
1099
+ "step": 14300
1100
+ },
1101
+ {
1102
+ "epoch": 2.8285412625530975,
1103
+ "grad_norm": 19.703887939453125,
1104
+ "learning_rate": 2.9580992350910446e-06,
1105
+ "loss": 30.57553955078125,
1106
+ "step": 14400
1107
+ },
1108
+ {
1109
+ "epoch": 2.848184251233825,
1110
+ "grad_norm": 19.73227882385254,
1111
+ "learning_rate": 2.619643945034861e-06,
1112
+ "loss": 30.6737548828125,
1113
+ "step": 14500
1114
+ },
1115
+ {
1116
+ "epoch": 2.8678272399145532,
1117
+ "grad_norm": 20.356998443603516,
1118
+ "learning_rate": 2.281188654978677e-06,
1119
+ "loss": 30.6970361328125,
1120
+ "step": 14600
1121
+ },
1122
+ {
1123
+ "epoch": 2.887470228595281,
1124
+ "grad_norm": 21.043643951416016,
1125
+ "learning_rate": 1.942733364922494e-06,
1126
+ "loss": 30.5473388671875,
1127
+ "step": 14700
1128
+ },
1129
+ {
1130
+ "epoch": 2.9071132172760086,
1131
+ "grad_norm": 18.5972843170166,
1132
+ "learning_rate": 1.6042780748663105e-06,
1133
+ "loss": 30.6621875,
1134
+ "step": 14800
1135
+ },
1136
+ {
1137
+ "epoch": 2.9267562059567362,
1138
+ "grad_norm": 19.071395874023438,
1139
+ "learning_rate": 1.2658227848101265e-06,
1140
+ "loss": 30.491064453125,
1141
+ "step": 14900
1142
+ },
1143
+ {
1144
+ "epoch": 2.946399194637464,
1145
+ "grad_norm": 19.5911808013916,
1146
+ "learning_rate": 9.27367494753943e-07,
1147
+ "loss": 30.54241455078125,
1148
+ "step": 15000
1149
+ },
1150
+ {
1151
+ "epoch": 2.966042183318192,
1152
+ "grad_norm": 21.04094696044922,
1153
+ "learning_rate": 5.889122046977595e-07,
1154
+ "loss": 30.4769384765625,
1155
+ "step": 15100
1156
+ },
1157
+ {
1158
+ "epoch": 2.9856851719989197,
1159
+ "grad_norm": 18.463882446289062,
1160
+ "learning_rate": 2.5045691464157585e-07,
1161
+ "loss": 30.60388916015625,
1162
+ "step": 15200
1163
+ },
1164
+ {
1165
+ "epoch": 2.9982566847545855,
1166
+ "eval_loss": 3.722125291824341,
1167
+ "eval_runtime": 42.0954,
1168
+ "eval_samples_per_second": 475.111,
1169
+ "eval_steps_per_second": 14.847,
1170
+ "step": 15264
1171
+ }
1172
+ ],
1173
+ "logging_steps": 100,
1174
+ "max_steps": 15273,
1175
+ "num_input_tokens_seen": 0,
1176
+ "num_train_epochs": 3,
1177
+ "save_steps": 1272,
1178
+ "stateful_callbacks": {
1179
+ "TrainerControl": {
1180
+ "args": {
1181
+ "should_epoch_stop": false,
1182
+ "should_evaluate": false,
1183
+ "should_log": false,
1184
+ "should_save": true,
1185
+ "should_training_stop": true
1186
+ },
1187
+ "attributes": {}
1188
+ }
1189
+ },
1190
+ "total_flos": 1.0290819911189391e+18,
1191
+ "train_batch_size": 32,
1192
+ "trial_name": null,
1193
+ "trial_params": null
1194
+ }
checkpoint-15273/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaec524752280134c2e0387d7f5f1e2cc6d34eaa3f289327a642e9b1d7d2b9c9
3
+ size 5137
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e5b34a93e4edb5d17560bd29970e86848d3ed25c9ed758b749e1f9bcbfa93606
3
- size 442633884
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:792f46cb1378b8ab1a168296ccf3cff6636948e4023ba6f87849d3969770c012
3
+ size 442633860
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3de483099e0f14e67b25caaa2bbb1cb1097bf08c651d7169f2211a8fd2657c92
3
  size 5137
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eaec524752280134c2e0387d7f5f1e2cc6d34eaa3f289327a642e9b1d7d2b9c9
3
  size 5137