hellosindh commited on
Commit
e0f6e92
·
verified ·
1 Parent(s): e99b148

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -7,272 +7,36 @@ tags:
7
  - bert
8
  - masked-language-modeling
9
  - from-scratch
10
- - nlp
11
- model-index:
12
- - name: sindhi-bert-base
13
- results:
14
- - task:
15
- type: fill-mask
16
- name: Masked Language Modeling
17
- metrics:
18
- - type: perplexity
19
- value: 28.46
20
- name: Perplexity (Session 3)
21
  ---
22
 
23
  # Sindhi-BERT-base
24
 
25
- The first BERT-style language model trained **from scratch** on Sindhi text, using a custom Sindhi BPE tokenizer with 32,000 pure Sindhi tokens.
26
-
27
- ---
28
 
29
  ## Training History
30
 
31
- | Session | Data | Epochs | Perplexity | Fill-Mask Quality | Time |
32
- |---|---|---|---|---|---|
33
- | Session 1 | 500K lines | 5 | 78.10 | 50% (5/10) | 301 min |
34
- | Session 2 | 1.5M lines | 3 | 41.62 | 70% (7/10) | 359 min |
35
- | **Session 3** | **1.49M lines (589MB clean)** | **2** | **28.46** | **80% (8/10)** | **224 min** |
36
-
37
- ---
38
-
39
- ## Model Details
40
-
41
- | Detail | Value |
42
- |---|---|
43
- | Architecture | RoBERTa-base |
44
- | Vocabulary | 32,000 tokens (pure Sindhi BPE) |
45
- | Hidden size | 768 |
46
- | Layers | 12 |
47
- | Attention heads | 12 |
48
- | Max length | 512 tokens |
49
- | Parameters | ~110M |
50
- | Language | Sindhi (sd) |
51
- | License | MIT |
52
-
53
- ---
54
-
55
- ## Session 3 Training Details
56
-
57
- | Detail | Value |
58
- |---|---|
59
- | Corpus size | 589 MB clean Sindhi text |
60
- | Total words | ~74 million |
61
- | Epochs | 2 |
62
- | Batch size | 64 (effective 256) |
63
- | Learning rate | 3e-5 |
64
- | LR scheduler | Cosine decay |
65
- | Warmup | 5% of total steps |
66
- | Precision | bf16 (A100) |
67
- | Gradient clipping | 1.5 |
68
- | Hardware | H100 GPU |
69
- | Training time | 224 minutes |
70
- | Eval loss | 3.348446 |
71
- | Perplexity | 28.46 |
72
-
73
- ---
74
-
75
- ## Fill-Mask Results — Session 3
76
-
77
- ### ✅ Correct Predictions (8/10)
78
-
79
- **1. Language identification**
80
- ```
81
- Input : سنڌي [MASK] دنيا جي قديم ٻولين مان ھڪ آھي
82
- ✅ Top 1 : ٻولي (language) — 40.90%
83
- Top 2 : ادب (literature) — 7.86%
84
- Top 3 : ٻوليءَ — 7.20%
85
- ```
86
-
87
- **2. People context**
88
- ```
89
- Input : پاڪستان ۾ سنڌي [MASK] گھڻي تعداد ۾ رھن ٿا
90
- ✅ Top 1 : ماڻهو (people) — 33.47%
91
- Top 2 : سنڌي — 2.65%
92
- Top 3 : ٻار (children) — 2.63%
93
- ```
94
-
95
- **3. City identification**
96
- ```
97
- Input : ڪراچي سنڌ جو سڀ کان وڏو [MASK] آھي
98
- ✅ Top 1 : شھر (city) — 16.72%
99
- Top 2 : حصو (part) — 7.02%
100
- Top 3 : ملڪ (country) — 4.06%
101
- ```
102
-
103
- **4. Direction context**
104
- ```
105
- Input : ھو پنھنجي [MASK] ڏانھن ويو
106
- ✅ Top 1 : گهر (home) — 11.67%
107
- Top 2 : ڳوٺ (village) — 6.63%
108
- Top 3 : منزل (destination) — 5.15%
109
- ```
110
-
111
- **5. Poet identification**
112
- ```
113
- Input : شاھه لطيف سنڌي [MASK] جو وڏو شاعر آھي
114
- ✅ Top 1 : شاعريءَ (poetry) — 25.77%
115
- Top 2 : ٻوليءَ (language) — 25.76%
116
- Top 3 : ادب (literature) — 13.00%
117
- ```
118
-
119
- **6. History context**
120
- ```
121
- Input : سنڌ جي [MASK] ڏاڍي پراڻي آھي
122
- ✅ Top 1 : تاريخ (history) — 16.04%
123
- Top 2 : ٻولي (language) — 3.88%
124
- Top 3 : ڌرتي (land) — 3.67%
125
- ```
126
-
127
- **7. Grammar word**
128
- ```
129
- Input : دنيا [MASK] گھڻي مصروف آھي
130
- ✅ Top 1 : ۾ (in) — 23.20%
131
- Top 2 : کي (to) — 17.54%
132
- Top 3 : جي (of) — 3.71%
133
- ```
134
-
135
- **8. Education context (close)**
136
- ```
137
- Input : استاد شاگردن کي [MASK] سيکاري ٿو
138
- ⚠️ Top 1 : استاد (teacher — repeats subject) — 15.87%
139
- ✅ Top 2 : تعليم (education) — 13.70%
140
- Top 3 : سبق (lesson) — 6.03%
141
- ```
142
-
143
- ---
144
-
145
- ### ❌ Incorrect Predictions (2/10)
146
-
147
- **9. School context (wrong)**
148
- ```
149
- Input : ٻار [MASK] ۾ پڙھن ٿا
150
- ❌ Top 1 : گهر (home) — 2.46% ← should be اسڪول (school)
151
- Top 2 : َ — 2.33% ← diacritic noise
152
- Top 3 : اکين (eyes) — 2.26%
153
- Expected : اسڪول (school) ← model needs more school context data
154
- ```
155
-
156
- **10. River context (close)**
157
- ```
158
- Input : سنڌو [MASK] سنڌ جي سڀيتا جو مرڪز رھيو آھي
159
- ⚠️ Top 1 : سڀيتا (civilization) — 15.54% ← repeats next word
160
- ✅ Top 2 : ندي (river) — 7.19% ← correct answer
161
- Top 3 : ۽ (and) — 5.82%
162
- Expected : ندي (river) ← correct but at Top 2
163
- ```
164
-
165
- ---
166
-
167
- ## Progress Across Sessions
168
-
169
- | Sentence | Session 1 | Session 2 | Session 3 |
170
- |---|---|---|---|
171
- | سنڌي ___ دنيا جي | ✅ ٻولي 15% | ✅ ٻولي 22% | ✅ ٻولي **40.90%** |
172
- | پاڪستان ۾ سنڌي ___ | ❌ | ✅ ماڻهو 49% | ✅ ماڻهو **33.47%** |
173
- | ڪراچي سنڌ جو ___ | ✅ Top 3 | ✅ شھر 9% | ✅ شھر **16.72%** |
174
- | ھو پنھنجي ___ ڏانھن | ⚠️ | ⚠️ | ✅ گهر **11.67%** |
175
- | شاھه لطيف ___ | ✅ | ✅ | ✅ شاعريءَ **25.77%** |
176
- | سنڌ جي ___ پراڻي | ✅ Top 2 | ✅ Top 1 | ✅ تاريخ **16.04%** |
177
- | استاد ___ سيکا��ي | ✅ تعليم | ❌ استاد | ⚠️ Top 2 تعليم |
178
- | ٻار ___ ۾ پڙھن | ❌ | ❌ | ❌ گهر |
179
- | دنيا ___ مصروف | ✅ ۾ | ✅ ۾ 38% | ✅ ۾ **23.20%** |
180
- | سنڌو ___ سنڌ جي | ❌ | ⚠️ Top 4 | ⚠️ Top 2 ندي |
181
- | **Score** | **50%** | **70%** | **80%** |
182
-
183
- ---
184
-
185
- ## Tokenizer
186
-
187
- Custom Sindhi BPE tokenizer — every Sindhi word stays as ONE token:
188
-
189
- ```python
190
- Input : سنڌي ٻولي دنيا جي قديم ٻولين مان ھڪ آھي
191
- Tokens : ['▁سنڌي', '▁ٻولي', '▁دنيا', '▁جي', '▁قديم', '▁ٻولين', '▁مان', '▁ھڪ', '▁آھي']
192
- Count : 9 words = 9 tokens ✅
193
- ```
194
-
195
- Unlike mBERT or XLM-R which split Sindhi words into multiple subword pieces, our tokenizer keeps each Sindhi word as a single token.
196
-
197
- ---
198
-
199
- ## Comparison With Other Models
200
-
201
- | Model | Type | Perplexity | Fill-mask Quality |
202
- |---|---|---|---|
203
- | mBERT fine-tuned | Multilingual | 4.19 | ❌ Predicts punctuation |
204
- | XLM-R fine-tuned | Multilingual | 5.88 | ✅ 80% correct |
205
- | SindhiBERT Session 1 | Sindhi only | 78.10 | ✅ 50% |
206
- | SindhiBERT Session 2 | Sindhi only | 41.62 | ✅ 70% |
207
- | **SindhiBERT Session 3** | **Sindhi only** | **28.46** | **✅ 80%** |
208
-
209
- > Note: mBERT/XLM-R perplexity is low because they start from pretrained multilingual weights. SindhiBERT starts from zero and learns pure Sindhi — its predictions are always real Sindhi words, never punctuation or non-Sindhi tokens.
210
-
211
- ---
212
 
213
  ## Usage
214
 
215
  ```python
216
- from transformers import AutoModelForMaskedLM
217
- import sentencepiece as spm
218
- import torch
219
  import torch.nn.functional as F
220
  from huggingface_hub import hf_hub_download
221
 
222
- # Load model
223
- model = AutoModelForMaskedLM.from_pretrained('hellosindh/sindhi-bert-base')
224
- model.eval()
225
-
226
- # Load tokenizer
227
- sp_path = hf_hub_download('hellosindh/sindhi-bert-base', 'sindhi_bpe_32k.model')
228
- sp = spm.SentencePieceProcessor()
229
- sp.Load(sp_path)
230
-
231
- # Constants
232
  MASK_ID = 32000
233
  BOS_ID = 2
234
  EOS_ID = 3
235
- VOCAB_SIZE = 32000
236
-
237
- def fill_mask(sentence, top_k=5):
238
- parts = sentence.split('[MASK]')
239
- left_ids = sp.EncodeAsIds(parts[0].strip())
240
- right_ids = sp.EncodeAsIds(parts[1].strip())
241
- input_ids = [BOS_ID] + left_ids + [MASK_ID] + right_ids + [EOS_ID]
242
- mask_pos = len(left_ids) + 1
243
- tensor = torch.tensor([input_ids])
244
- with torch.no_grad():
245
- logits = model(tensor).logits[0, mask_pos]
246
- logits[MASK_ID] = -float('inf')
247
- probs = F.softmax(logits[:VOCAB_SIZE], dim=-1)
248
- top_probs, top_ids = torch.topk(probs, top_k)
249
- for prob, idx in zip(top_probs, top_ids):
250
- word = sp.IdToPiece(idx.item()).replace('▁', '')
251
- print(f'{word:<20} {prob.item()*100:.2f}%')
252
 
253
- # Example
254
- fill_mask('سنڌي [MASK] دنيا جي قديم ٻولين مان ھڪ آھي')
255
- # ٻولي 40.90%
256
- # ادب 7.86%
257
  ```
258
-
259
- ---
260
-
261
- ## Roadmap
262
-
263
- - [x] Custom Sindhi BPE tokenizer (32K vocab)
264
- - [x] Session 1 — 500K lines, 5 epochs, PPL 78.10
265
- - [x] Session 2 — 1.5M lines, 3 epochs, PPL 41.62
266
- - [x] Session 3 — 589MB clean corpus, 2 epochs, PPL 28.46
267
- - [ ] Session 4 — more data + 3 epochs → target PPL ~18
268
- - [ ] Session 5 — fine-tune lower LR → target PPL ~12
269
- - [ ] Spell checker fine-tuning
270
- - [ ] Next word prediction
271
- - [ ] Named entity recognition
272
- - [ ] Sindhi chatbot
273
-
274
- ---
275
-
276
- ## About
277
-
278
- The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.
 
7
  - bert
8
  - masked-language-modeling
9
  - from-scratch
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  # Sindhi-BERT-base
13
 
14
+ First BERT-style model trained from scratch on Sindhi text.
 
 
15
 
16
  ## Training History
17
 
18
+ | Session | Data | Epochs | PPL | Notes |
19
+ |---|---|---|---|---|
20
+ | S1 | 500K lines | 5 | 78.10 | from scratch |
21
+ | S2 | 1.5M lines | 3 | 41.62 | continued |
22
+ | S3 | 1.49M lines | 2 | 28.46 | bf16, cosine LR |
23
+ | S4 | 87M words | 3 | 35.42 | grouped context |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Usage
26
 
27
  ```python
28
+ from transformers import RobertaForMaskedLM
29
+ import sentencepiece as spm, torch
 
30
  import torch.nn.functional as F
31
  from huggingface_hub import hf_hub_download
32
 
33
+ REPO = "hellosindh/sindhi-bert-base"
 
 
 
 
 
 
 
 
 
34
  MASK_ID = 32000
35
  BOS_ID = 2
36
  EOS_ID = 3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ model = RobertaForMaskedLM.from_pretrained(REPO)
39
+ sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
40
+ sp = spm.SentencePieceProcessor()
41
+ sp.Load(sp_path)
42
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
checkpoint-3924/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 1,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 514,
19
+ "model_type": "roberta",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "tie_word_embeddings": true,
24
+ "transformers_version": "5.0.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": false,
27
+ "vocab_size": 32001
28
+ }
checkpoint-3924/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66c9b40b4d1b2943a622be928e3f8beb231f2cf80d2acbe19352c740edfa76b9
3
+ size 442633860
checkpoint-3924/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cdbb31b8e427b2d5c5d5dce127c362cb391d70f8282995b2a405651b6695774
3
+ size 885391563
checkpoint-3924/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35f5af9b38d87cb532b16dd4de5175c2910bc86cf1976c6ccc3668da1c53606d
3
+ size 14645
checkpoint-3924/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eef8b1a8fe3ca13b13452c68d049d5772a114b25d47fd7c271209bdd37c174b
3
+ size 1465
checkpoint-3924/trainer_state.json ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 3924,
3
+ "best_metric": 3.56946063041687,
4
+ "best_model_checkpoint": "sindhibert_session4/checkpoint-3924",
5
+ "epoch": 2.0,
6
+ "eval_steps": 1962,
7
+ "global_step": 3924,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.05098139179199592,
14
+ "grad_norm": 4.590001106262207,
15
+ "learning_rate": 5.609065155807366e-06,
16
+ "loss": 15.86372314453125,
17
+ "step": 100
18
+ },
19
+ {
20
+ "epoch": 0.10196278358399184,
21
+ "grad_norm": 5.000253677368164,
22
+ "learning_rate": 1.1274787535410765e-05,
23
+ "loss": 15.6683056640625,
24
+ "step": 200
25
+ },
26
+ {
27
+ "epoch": 0.15294417537598776,
28
+ "grad_norm": 5.164661407470703,
29
+ "learning_rate": 1.6940509915014164e-05,
30
+ "loss": 15.58547607421875,
31
+ "step": 300
32
+ },
33
+ {
34
+ "epoch": 0.20392556716798368,
35
+ "grad_norm": 4.895200729370117,
36
+ "learning_rate": 1.999658933249201e-05,
37
+ "loss": 15.5261376953125,
38
+ "step": 400
39
+ },
40
+ {
41
+ "epoch": 0.2549069589599796,
42
+ "grad_norm": 5.010247707366943,
43
+ "learning_rate": 1.9965659596003744e-05,
44
+ "loss": 15.493291015625,
45
+ "step": 500
46
+ },
47
+ {
48
+ "epoch": 0.3058883507519755,
49
+ "grad_norm": 4.85853910446167,
50
+ "learning_rate": 1.990261043359342e-05,
51
+ "loss": 15.43971435546875,
52
+ "step": 600
53
+ },
54
+ {
55
+ "epoch": 0.35686974254397147,
56
+ "grad_norm": 4.788653373718262,
57
+ "learning_rate": 1.9807645053376055e-05,
58
+ "loss": 15.409666748046876,
59
+ "step": 700
60
+ },
61
+ {
62
+ "epoch": 0.40785113433596737,
63
+ "grad_norm": 4.742185592651367,
64
+ "learning_rate": 1.968106952977309e-05,
65
+ "loss": 15.346304931640624,
66
+ "step": 800
67
+ },
68
+ {
69
+ "epoch": 0.45883252612796327,
70
+ "grad_norm": 4.758422374725342,
71
+ "learning_rate": 1.9523291817031276e-05,
72
+ "loss": 15.344024658203125,
73
+ "step": 900
74
+ },
75
+ {
76
+ "epoch": 0.5098139179199592,
77
+ "grad_norm": 4.854381084442139,
78
+ "learning_rate": 1.933482043438185e-05,
79
+ "loss": 15.307811279296875,
80
+ "step": 1000
81
+ },
82
+ {
83
+ "epoch": 0.5607953097119551,
84
+ "grad_norm": 4.7934041023254395,
85
+ "learning_rate": 1.9116262827077703e-05,
86
+ "loss": 15.254422607421875,
87
+ "step": 1100
88
+ },
89
+ {
90
+ "epoch": 0.611776701503951,
91
+ "grad_norm": 4.670731544494629,
92
+ "learning_rate": 1.88683234085909e-05,
93
+ "loss": 15.23345703125,
94
+ "step": 1200
95
+ },
96
+ {
97
+ "epoch": 0.6627580932959469,
98
+ "grad_norm": 4.993561267852783,
99
+ "learning_rate": 1.8591801290280664e-05,
100
+ "loss": 15.2450927734375,
101
+ "step": 1300
102
+ },
103
+ {
104
+ "epoch": 0.7137394850879429,
105
+ "grad_norm": 4.720964431762695,
106
+ "learning_rate": 1.8287587705849013e-05,
107
+ "loss": 15.1839599609375,
108
+ "step": 1400
109
+ },
110
+ {
111
+ "epoch": 0.7647208768799388,
112
+ "grad_norm": 5.050419330596924,
113
+ "learning_rate": 1.7956663138885173e-05,
114
+ "loss": 15.164833984375,
115
+ "step": 1500
116
+ },
117
+ {
118
+ "epoch": 0.8157022686719347,
119
+ "grad_norm": 4.826648712158203,
120
+ "learning_rate": 1.760009416275661e-05,
121
+ "loss": 15.130496826171875,
122
+ "step": 1600
123
+ },
124
+ {
125
+ "epoch": 0.8666836604639306,
126
+ "grad_norm": 4.858438014984131,
127
+ "learning_rate": 1.721903000303185e-05,
128
+ "loss": 15.125797119140625,
129
+ "step": 1700
130
+ },
131
+ {
132
+ "epoch": 0.9176650522559265,
133
+ "grad_norm": 4.9611430168151855,
134
+ "learning_rate": 1.6814698833514326e-05,
135
+ "loss": 15.13617431640625,
136
+ "step": 1800
137
+ },
138
+ {
139
+ "epoch": 0.9686464440479226,
140
+ "grad_norm": 4.663859844207764,
141
+ "learning_rate": 1.63884038178253e-05,
142
+ "loss": 15.072591552734375,
143
+ "step": 1900
144
+ },
145
+ {
146
+ "epoch": 1.0,
147
+ "eval_loss": 3.636704444885254,
148
+ "eval_runtime": 8.0138,
149
+ "eval_samples_per_second": 632.91,
150
+ "eval_steps_per_second": 9.983,
151
+ "step": 1962
152
+ },
153
+ {
154
+ "epoch": 1.0193729288809585,
155
+ "grad_norm": 4.863068103790283,
156
+ "learning_rate": 1.5941518909293737e-05,
157
+ "loss": 14.968798828125,
158
+ "step": 2000
159
+ },
160
+ {
161
+ "epoch": 1.0703543206729544,
162
+ "grad_norm": 5.036495685577393,
163
+ "learning_rate": 1.5475484422690282e-05,
164
+ "loss": 15.0290869140625,
165
+ "step": 2100
166
+ },
167
+ {
168
+ "epoch": 1.1213357124649503,
169
+ "grad_norm": 5.248174667358398,
170
+ "learning_rate": 1.4991802392077543e-05,
171
+ "loss": 15.004036865234376,
172
+ "step": 2200
173
+ },
174
+ {
175
+ "epoch": 1.1723171042569462,
176
+ "grad_norm": 4.950564384460449,
177
+ "learning_rate": 1.4492031729738489e-05,
178
+ "loss": 15.002611083984375,
179
+ "step": 2300
180
+ },
181
+ {
182
+ "epoch": 1.2232984960489421,
183
+ "grad_norm": 4.509192943572998,
184
+ "learning_rate": 1.3977783201785732e-05,
185
+ "loss": 14.96060302734375,
186
+ "step": 2400
187
+ },
188
+ {
189
+ "epoch": 1.274279887840938,
190
+ "grad_norm": 4.900182723999023,
191
+ "learning_rate": 1.3450714236645352e-05,
192
+ "loss": 14.971297607421874,
193
+ "step": 2500
194
+ },
195
+ {
196
+ "epoch": 1.325261279632934,
197
+ "grad_norm": 5.138764381408691,
198
+ "learning_rate": 1.2912523583147625e-05,
199
+ "loss": 14.928385009765625,
200
+ "step": 2600
201
+ },
202
+ {
203
+ "epoch": 1.3762426714249298,
204
+ "grad_norm": 4.894199848175049,
205
+ "learning_rate": 1.2364945835441636e-05,
206
+ "loss": 14.938167724609375,
207
+ "step": 2700
208
+ },
209
+ {
210
+ "epoch": 1.4272240632169257,
211
+ "grad_norm": 4.8737921714782715,
212
+ "learning_rate": 1.1809745842380042e-05,
213
+ "loss": 14.923902587890625,
214
+ "step": 2800
215
+ },
216
+ {
217
+ "epoch": 1.4782054550089216,
218
+ "grad_norm": 4.8258819580078125,
219
+ "learning_rate": 1.1248713019392635e-05,
220
+ "loss": 14.89677001953125,
221
+ "step": 2900
222
+ },
223
+ {
224
+ "epoch": 1.5291868468009175,
225
+ "grad_norm": 4.769787788391113,
226
+ "learning_rate": 1.0683655581181524e-05,
227
+ "loss": 14.87692626953125,
228
+ "step": 3000
229
+ },
230
+ {
231
+ "epoch": 1.5801682385929134,
232
+ "grad_norm": 4.92316198348999,
233
+ "learning_rate": 1.0116394713826117e-05,
234
+ "loss": 14.849693603515625,
235
+ "step": 3100
236
+ },
237
+ {
238
+ "epoch": 1.6311496303849093,
239
+ "grad_norm": 4.873258590698242,
240
+ "learning_rate": 9.548758705081177e-06,
241
+ "loss": 14.833634033203126,
242
+ "step": 3200
243
+ },
244
+ {
245
+ "epoch": 1.6821310221769055,
246
+ "grad_norm": 4.738825798034668,
247
+ "learning_rate": 8.98257705178612e-06,
248
+ "loss": 14.85665283203125,
249
+ "step": 3300
250
+ },
251
+ {
252
+ "epoch": 1.7331124139689014,
253
+ "grad_norm": 4.907736778259277,
254
+ "learning_rate": 8.419674563377416e-06,
255
+ "loss": 14.8664599609375,
256
+ "step": 3400
257
+ },
258
+ {
259
+ "epoch": 1.7840938057608973,
260
+ "grad_norm": 4.977413177490234,
261
+ "learning_rate": 7.861865480508541e-06,
262
+ "loss": 14.83008056640625,
263
+ "step": 3500
264
+ },
265
+ {
266
+ "epoch": 1.8350751975528932,
267
+ "grad_norm": 4.792273044586182,
268
+ "learning_rate": 7.310947627733231e-06,
269
+ "loss": 14.81404541015625,
270
+ "step": 3600
271
+ },
272
+ {
273
+ "epoch": 1.886056589344889,
274
+ "grad_norm": 4.84648323059082,
275
+ "learning_rate": 6.768696619097996e-06,
276
+ "loss": 14.831793212890625,
277
+ "step": 3700
278
+ },
279
+ {
280
+ "epoch": 1.9370379811368852,
281
+ "grad_norm": 4.854404449462891,
282
+ "learning_rate": 6.236860135319321e-06,
283
+ "loss": 14.826976318359375,
284
+ "step": 3800
285
+ },
286
+ {
287
+ "epoch": 1.988019372928881,
288
+ "grad_norm": 4.615888595581055,
289
+ "learning_rate": 5.717152290990302e-06,
290
+ "loss": 14.767562255859374,
291
+ "step": 3900
292
+ },
293
+ {
294
+ "epoch": 2.0,
295
+ "eval_loss": 3.56946063041687,
296
+ "eval_runtime": 8.0481,
297
+ "eval_samples_per_second": 630.208,
298
+ "eval_steps_per_second": 9.94,
299
+ "step": 3924
300
+ }
301
+ ],
302
+ "logging_steps": 100,
303
+ "max_steps": 5886,
304
+ "num_input_tokens_seen": 0,
305
+ "num_train_epochs": 3,
306
+ "save_steps": 1962,
307
+ "stateful_callbacks": {
308
+ "EarlyStoppingCallback": {
309
+ "args": {
310
+ "early_stopping_patience": 3,
311
+ "early_stopping_threshold": 0.0
312
+ },
313
+ "attributes": {
314
+ "early_stopping_patience_counter": 0
315
+ }
316
+ },
317
+ "TrainerControl": {
318
+ "args": {
319
+ "should_epoch_stop": false,
320
+ "should_evaluate": false,
321
+ "should_log": false,
322
+ "should_save": true,
323
+ "should_training_stop": false
324
+ },
325
+ "attributes": {}
326
+ }
327
+ },
328
+ "total_flos": 2.643322074019246e+17,
329
+ "train_batch_size": 64,
330
+ "trial_name": null,
331
+ "trial_params": null
332
+ }
checkpoint-3924/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
3
+ size 5137
checkpoint-5886/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 1,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 514,
19
+ "model_type": "roberta",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 0,
23
+ "tie_word_embeddings": true,
24
+ "transformers_version": "5.0.0",
25
+ "type_vocab_size": 1,
26
+ "use_cache": false,
27
+ "vocab_size": 32001
28
+ }
checkpoint-5886/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:170894fbff2599922589dc645dfc871455543fe1f1fa33d3381f8353cf0b2a5b
3
+ size 442633860
checkpoint-5886/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:49f438e933e34f365171b080043f51c3931028fb9b12b84462700e4fec8ed022
3
+ size 885391563
checkpoint-5886/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b568051719bceb1b41126c825c8846c1625bce2c01817c9c4450273020cfb29
3
+ size 14645
checkpoint-5886/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8e0fb9255f3eabc9bbca3c948e4f71fe410e407e554e26add1a06864fa8f902
3
+ size 1465
checkpoint-5886/trainer_state.json ADDED
@@ -0,0 +1,473 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 5886,
3
+ "best_metric": 3.5591108798980713,
4
+ "best_model_checkpoint": "sindhibert_session4/checkpoint-5886",
5
+ "epoch": 3.0,
6
+ "eval_steps": 1962,
7
+ "global_step": 5886,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.05098139179199592,
14
+ "grad_norm": 4.590001106262207,
15
+ "learning_rate": 5.609065155807366e-06,
16
+ "loss": 15.86372314453125,
17
+ "step": 100
18
+ },
19
+ {
20
+ "epoch": 0.10196278358399184,
21
+ "grad_norm": 5.000253677368164,
22
+ "learning_rate": 1.1274787535410765e-05,
23
+ "loss": 15.6683056640625,
24
+ "step": 200
25
+ },
26
+ {
27
+ "epoch": 0.15294417537598776,
28
+ "grad_norm": 5.164661407470703,
29
+ "learning_rate": 1.6940509915014164e-05,
30
+ "loss": 15.58547607421875,
31
+ "step": 300
32
+ },
33
+ {
34
+ "epoch": 0.20392556716798368,
35
+ "grad_norm": 4.895200729370117,
36
+ "learning_rate": 1.999658933249201e-05,
37
+ "loss": 15.5261376953125,
38
+ "step": 400
39
+ },
40
+ {
41
+ "epoch": 0.2549069589599796,
42
+ "grad_norm": 5.010247707366943,
43
+ "learning_rate": 1.9965659596003744e-05,
44
+ "loss": 15.493291015625,
45
+ "step": 500
46
+ },
47
+ {
48
+ "epoch": 0.3058883507519755,
49
+ "grad_norm": 4.85853910446167,
50
+ "learning_rate": 1.990261043359342e-05,
51
+ "loss": 15.43971435546875,
52
+ "step": 600
53
+ },
54
+ {
55
+ "epoch": 0.35686974254397147,
56
+ "grad_norm": 4.788653373718262,
57
+ "learning_rate": 1.9807645053376055e-05,
58
+ "loss": 15.409666748046876,
59
+ "step": 700
60
+ },
61
+ {
62
+ "epoch": 0.40785113433596737,
63
+ "grad_norm": 4.742185592651367,
64
+ "learning_rate": 1.968106952977309e-05,
65
+ "loss": 15.346304931640624,
66
+ "step": 800
67
+ },
68
+ {
69
+ "epoch": 0.45883252612796327,
70
+ "grad_norm": 4.758422374725342,
71
+ "learning_rate": 1.9523291817031276e-05,
72
+ "loss": 15.344024658203125,
73
+ "step": 900
74
+ },
75
+ {
76
+ "epoch": 0.5098139179199592,
77
+ "grad_norm": 4.854381084442139,
78
+ "learning_rate": 1.933482043438185e-05,
79
+ "loss": 15.307811279296875,
80
+ "step": 1000
81
+ },
82
+ {
83
+ "epoch": 0.5607953097119551,
84
+ "grad_norm": 4.7934041023254395,
85
+ "learning_rate": 1.9116262827077703e-05,
86
+ "loss": 15.254422607421875,
87
+ "step": 1100
88
+ },
89
+ {
90
+ "epoch": 0.611776701503951,
91
+ "grad_norm": 4.670731544494629,
92
+ "learning_rate": 1.88683234085909e-05,
93
+ "loss": 15.23345703125,
94
+ "step": 1200
95
+ },
96
+ {
97
+ "epoch": 0.6627580932959469,
98
+ "grad_norm": 4.993561267852783,
99
+ "learning_rate": 1.8591801290280664e-05,
100
+ "loss": 15.2450927734375,
101
+ "step": 1300
102
+ },
103
+ {
104
+ "epoch": 0.7137394850879429,
105
+ "grad_norm": 4.720964431762695,
106
+ "learning_rate": 1.8287587705849013e-05,
107
+ "loss": 15.1839599609375,
108
+ "step": 1400
109
+ },
110
+ {
111
+ "epoch": 0.7647208768799388,
112
+ "grad_norm": 5.050419330596924,
113
+ "learning_rate": 1.7956663138885173e-05,
114
+ "loss": 15.164833984375,
115
+ "step": 1500
116
+ },
117
+ {
118
+ "epoch": 0.8157022686719347,
119
+ "grad_norm": 4.826648712158203,
120
+ "learning_rate": 1.760009416275661e-05,
121
+ "loss": 15.130496826171875,
122
+ "step": 1600
123
+ },
124
+ {
125
+ "epoch": 0.8666836604639306,
126
+ "grad_norm": 4.858438014984131,
127
+ "learning_rate": 1.721903000303185e-05,
128
+ "loss": 15.125797119140625,
129
+ "step": 1700
130
+ },
131
+ {
132
+ "epoch": 0.9176650522559265,
133
+ "grad_norm": 4.9611430168151855,
134
+ "learning_rate": 1.6814698833514326e-05,
135
+ "loss": 15.13617431640625,
136
+ "step": 1800
137
+ },
138
+ {
139
+ "epoch": 0.9686464440479226,
140
+ "grad_norm": 4.663859844207764,
141
+ "learning_rate": 1.63884038178253e-05,
142
+ "loss": 15.072591552734375,
143
+ "step": 1900
144
+ },
145
+ {
146
+ "epoch": 1.0,
147
+ "eval_loss": 3.636704444885254,
148
+ "eval_runtime": 8.0138,
149
+ "eval_samples_per_second": 632.91,
150
+ "eval_steps_per_second": 9.983,
151
+ "step": 1962
152
+ },
153
+ {
154
+ "epoch": 1.0193729288809585,
155
+ "grad_norm": 4.863068103790283,
156
+ "learning_rate": 1.5941518909293737e-05,
157
+ "loss": 14.968798828125,
158
+ "step": 2000
159
+ },
160
+ {
161
+ "epoch": 1.0703543206729544,
162
+ "grad_norm": 5.036495685577393,
163
+ "learning_rate": 1.5475484422690282e-05,
164
+ "loss": 15.0290869140625,
165
+ "step": 2100
166
+ },
167
+ {
168
+ "epoch": 1.1213357124649503,
169
+ "grad_norm": 5.248174667358398,
170
+ "learning_rate": 1.4991802392077543e-05,
171
+ "loss": 15.004036865234376,
172
+ "step": 2200
173
+ },
174
+ {
175
+ "epoch": 1.1723171042569462,
176
+ "grad_norm": 4.950564384460449,
177
+ "learning_rate": 1.4492031729738489e-05,
178
+ "loss": 15.002611083984375,
179
+ "step": 2300
180
+ },
181
+ {
182
+ "epoch": 1.2232984960489421,
183
+ "grad_norm": 4.509192943572998,
184
+ "learning_rate": 1.3977783201785732e-05,
185
+ "loss": 14.96060302734375,
186
+ "step": 2400
187
+ },
188
+ {
189
+ "epoch": 1.274279887840938,
190
+ "grad_norm": 4.900182723999023,
191
+ "learning_rate": 1.3450714236645352e-05,
192
+ "loss": 14.971297607421874,
193
+ "step": 2500
194
+ },
195
+ {
196
+ "epoch": 1.325261279632934,
197
+ "grad_norm": 5.138764381408691,
198
+ "learning_rate": 1.2912523583147625e-05,
199
+ "loss": 14.928385009765625,
200
+ "step": 2600
201
+ },
202
+ {
203
+ "epoch": 1.3762426714249298,
204
+ "grad_norm": 4.894199848175049,
205
+ "learning_rate": 1.2364945835441636e-05,
206
+ "loss": 14.938167724609375,
207
+ "step": 2700
208
+ },
209
+ {
210
+ "epoch": 1.4272240632169257,
211
+ "grad_norm": 4.8737921714782715,
212
+ "learning_rate": 1.1809745842380042e-05,
213
+ "loss": 14.923902587890625,
214
+ "step": 2800
215
+ },
216
+ {
217
+ "epoch": 1.4782054550089216,
218
+ "grad_norm": 4.8258819580078125,
219
+ "learning_rate": 1.1248713019392635e-05,
220
+ "loss": 14.89677001953125,
221
+ "step": 2900
222
+ },
223
+ {
224
+ "epoch": 1.5291868468009175,
225
+ "grad_norm": 4.769787788391113,
226
+ "learning_rate": 1.0683655581181524e-05,
227
+ "loss": 14.87692626953125,
228
+ "step": 3000
229
+ },
230
+ {
231
+ "epoch": 1.5801682385929134,
232
+ "grad_norm": 4.92316198348999,
233
+ "learning_rate": 1.0116394713826117e-05,
234
+ "loss": 14.849693603515625,
235
+ "step": 3100
236
+ },
237
+ {
238
+ "epoch": 1.6311496303849093,
239
+ "grad_norm": 4.873258590698242,
240
+ "learning_rate": 9.548758705081177e-06,
241
+ "loss": 14.833634033203126,
242
+ "step": 3200
243
+ },
244
+ {
245
+ "epoch": 1.6821310221769055,
246
+ "grad_norm": 4.738825798034668,
247
+ "learning_rate": 8.98257705178612e-06,
248
+ "loss": 14.85665283203125,
249
+ "step": 3300
250
+ },
251
+ {
252
+ "epoch": 1.7331124139689014,
253
+ "grad_norm": 4.907736778259277,
254
+ "learning_rate": 8.419674563377416e-06,
255
+ "loss": 14.8664599609375,
256
+ "step": 3400
257
+ },
258
+ {
259
+ "epoch": 1.7840938057608973,
260
+ "grad_norm": 4.977413177490234,
261
+ "learning_rate": 7.861865480508541e-06,
262
+ "loss": 14.83008056640625,
263
+ "step": 3500
264
+ },
265
+ {
266
+ "epoch": 1.8350751975528932,
267
+ "grad_norm": 4.792273044586182,
268
+ "learning_rate": 7.310947627733231e-06,
269
+ "loss": 14.81404541015625,
270
+ "step": 3600
271
+ },
272
+ {
273
+ "epoch": 1.886056589344889,
274
+ "grad_norm": 4.84648323059082,
275
+ "learning_rate": 6.768696619097996e-06,
276
+ "loss": 14.831793212890625,
277
+ "step": 3700
278
+ },
279
+ {
280
+ "epoch": 1.9370379811368852,
281
+ "grad_norm": 4.854404449462891,
282
+ "learning_rate": 6.236860135319321e-06,
283
+ "loss": 14.826976318359375,
284
+ "step": 3800
285
+ },
286
+ {
287
+ "epoch": 1.988019372928881,
288
+ "grad_norm": 4.615888595581055,
289
+ "learning_rate": 5.717152290990302e-06,
290
+ "loss": 14.767562255859374,
291
+ "step": 3900
292
+ },
293
+ {
294
+ "epoch": 2.0,
295
+ "eval_loss": 3.56946063041687,
296
+ "eval_runtime": 8.0481,
297
+ "eval_samples_per_second": 630.208,
298
+ "eval_steps_per_second": 9.94,
299
+ "step": 3924
300
+ },
301
+ {
302
+ "epoch": 2.038745857761917,
303
+ "grad_norm": 5.015805721282959,
304
+ "learning_rate": 5.211248109971254e-06,
305
+ "loss": 14.695634765625,
306
+ "step": 4000
307
+ },
308
+ {
309
+ "epoch": 2.089727249553913,
310
+ "grad_norm": 4.800245761871338,
311
+ "learning_rate": 4.720778126770141e-06,
312
+ "loss": 14.764068603515625,
313
+ "step": 4100
314
+ },
315
+ {
316
+ "epoch": 2.140708641345909,
317
+ "grad_norm": 4.756154537200928,
318
+ "learning_rate": 4.247323131312676e-06,
319
+ "loss": 14.755054931640625,
320
+ "step": 4200
321
+ },
322
+ {
323
+ "epoch": 2.191690033137905,
324
+ "grad_norm": 4.989803314208984,
325
+ "learning_rate": 3.7924090740397178e-06,
326
+ "loss": 14.760721435546875,
327
+ "step": 4300
328
+ },
329
+ {
330
+ "epoch": 2.2426714249299007,
331
+ "grad_norm": 4.568801403045654,
332
+ "learning_rate": 3.3575021477529313e-06,
333
+ "loss": 14.72455810546875,
334
+ "step": 4400
335
+ },
336
+ {
337
+ "epoch": 2.2936528167218966,
338
+ "grad_norm": 4.871072769165039,
339
+ "learning_rate": 2.944004062059924e-06,
340
+ "loss": 14.743800048828126,
341
+ "step": 4500
342
+ },
343
+ {
344
+ "epoch": 2.3446342085138925,
345
+ "grad_norm": 4.790256500244141,
346
+ "learning_rate": 2.5532475256494073e-06,
347
+ "loss": 14.7241162109375,
348
+ "step": 4600
349
+ },
350
+ {
351
+ "epoch": 2.3956156003058884,
352
+ "grad_norm": 4.770144462585449,
353
+ "learning_rate": 2.186491950957048e-06,
354
+ "loss": 14.711162109375,
355
+ "step": 4700
356
+ },
357
+ {
358
+ "epoch": 2.4465969920978843,
359
+ "grad_norm": 4.44427490234375,
360
+ "learning_rate": 1.8449193950659018e-06,
361
+ "loss": 14.72890625,
362
+ "step": 4800
363
+ },
364
+ {
365
+ "epoch": 2.49757838388988,
366
+ "grad_norm": 4.664465427398682,
367
+ "learning_rate": 1.5296307499239903e-06,
368
+ "loss": 14.713804931640626,
369
+ "step": 4900
370
+ },
371
+ {
372
+ "epoch": 2.548559775681876,
373
+ "grad_norm": 4.861291408538818,
374
+ "learning_rate": 1.2416421941579448e-06,
375
+ "loss": 14.730694580078126,
376
+ "step": 5000
377
+ },
378
+ {
379
+ "epoch": 2.599541167473872,
380
+ "grad_norm": 4.662012577056885,
381
+ "learning_rate": 9.818819179185713e-07,
382
+ "loss": 14.70477294921875,
383
+ "step": 5100
384
+ },
385
+ {
386
+ "epoch": 2.650522559265868,
387
+ "grad_norm": 4.803001403808594,
388
+ "learning_rate": 7.511871313142238e-07,
389
+ "loss": 14.7314208984375,
390
+ "step": 5200
391
+ },
392
+ {
393
+ "epoch": 2.701503951057864,
394
+ "grad_norm": 4.746646404266357,
395
+ "learning_rate": 5.503013660737899e-07,
396
+ "loss": 14.70580810546875,
397
+ "step": 5300
398
+ },
399
+ {
400
+ "epoch": 2.7524853428498597,
401
+ "grad_norm": 4.867108345031738,
402
+ "learning_rate": 3.798720791360988e-07,
403
+ "loss": 14.710306396484375,
404
+ "step": 5400
405
+ },
406
+ {
407
+ "epoch": 2.8034667346418556,
408
+ "grad_norm": 4.6949992179870605,
409
+ "learning_rate": 2.404485658893807e-07,
410
+ "loss": 14.725491943359375,
411
+ "step": 5500
412
+ },
413
+ {
414
+ "epoch": 2.8544481264338515,
415
+ "grad_norm": 4.641607284545898,
416
+ "learning_rate": 1.3248018978643695e-07,
417
+ "loss": 14.7078369140625,
418
+ "step": 5600
419
+ },
420
+ {
421
+ "epoch": 2.905429518225848,
422
+ "grad_norm": 4.756202220916748,
423
+ "learning_rate": 5.6314934041501455e-08,
424
+ "loss": 14.697396240234376,
425
+ "step": 5700
426
+ },
427
+ {
428
+ "epoch": 2.9564109100178433,
429
+ "grad_norm": 4.691574573516846,
430
+ "learning_rate": 1.2198280076668455e-08,
431
+ "loss": 14.694278564453125,
432
+ "step": 5800
433
+ },
434
+ {
435
+ "epoch": 3.0,
436
+ "eval_loss": 3.5591108798980713,
437
+ "eval_runtime": 8.0338,
438
+ "eval_samples_per_second": 631.333,
439
+ "eval_steps_per_second": 9.958,
440
+ "step": 5886
441
+ }
442
+ ],
443
+ "logging_steps": 100,
444
+ "max_steps": 5886,
445
+ "num_input_tokens_seen": 0,
446
+ "num_train_epochs": 3,
447
+ "save_steps": 1962,
448
+ "stateful_callbacks": {
449
+ "EarlyStoppingCallback": {
450
+ "args": {
451
+ "early_stopping_patience": 3,
452
+ "early_stopping_threshold": 0.0
453
+ },
454
+ "attributes": {
455
+ "early_stopping_patience_counter": 0
456
+ }
457
+ },
458
+ "TrainerControl": {
459
+ "args": {
460
+ "should_epoch_stop": false,
461
+ "should_evaluate": false,
462
+ "should_log": false,
463
+ "should_save": true,
464
+ "should_training_stop": true
465
+ },
466
+ "attributes": {}
467
+ }
468
+ },
469
+ "total_flos": 3.964983111028869e+17,
470
+ "train_batch_size": 64,
471
+ "trial_name": null,
472
+ "trial_params": null
473
+ }
checkpoint-5886/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
3
+ size 5137
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:af548243f2a2884a4a369c6b04c497110cb9a587cea0a5041e9a0820c72889ef
3
  size 442633860
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:170894fbff2599922589dc645dfc871455543fe1f1fa33d3381f8353cf0b2a5b
3
  size 442633860
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dc38b2eea3f8755ab49032af3c555b4a3e9c23274e629dd4c763171401716a57
3
  size 5137
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
3
  size 5137