Upload folder using huggingface_hub
Browse files- README.md +14 -250
- checkpoint-3924/config.json +28 -0
- checkpoint-3924/model.safetensors +3 -0
- checkpoint-3924/optimizer.pt +3 -0
- checkpoint-3924/rng_state.pth +3 -0
- checkpoint-3924/scheduler.pt +3 -0
- checkpoint-3924/trainer_state.json +332 -0
- checkpoint-3924/training_args.bin +3 -0
- checkpoint-5886/config.json +28 -0
- checkpoint-5886/model.safetensors +3 -0
- checkpoint-5886/optimizer.pt +3 -0
- checkpoint-5886/rng_state.pth +3 -0
- checkpoint-5886/scheduler.pt +3 -0
- checkpoint-5886/trainer_state.json +473 -0
- checkpoint-5886/training_args.bin +3 -0
- model.safetensors +1 -1
- training_args.bin +1 -1
README.md
CHANGED
|
@@ -7,272 +7,36 @@ tags:
|
|
| 7 |
- bert
|
| 8 |
- masked-language-modeling
|
| 9 |
- from-scratch
|
| 10 |
-
- nlp
|
| 11 |
-
model-index:
|
| 12 |
-
- name: sindhi-bert-base
|
| 13 |
-
results:
|
| 14 |
-
- task:
|
| 15 |
-
type: fill-mask
|
| 16 |
-
name: Masked Language Modeling
|
| 17 |
-
metrics:
|
| 18 |
-
- type: perplexity
|
| 19 |
-
value: 28.46
|
| 20 |
-
name: Perplexity (Session 3)
|
| 21 |
---
|
| 22 |
|
| 23 |
# Sindhi-BERT-base
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
|
| 29 |
## Training History
|
| 30 |
|
| 31 |
-
| Session | Data | Epochs |
|
| 32 |
-
|---|---|---|---|---|
|
| 33 |
-
|
|
| 34 |
-
|
|
| 35 |
-
|
|
| 36 |
-
|
| 37 |
-
---
|
| 38 |
-
|
| 39 |
-
## Model Details
|
| 40 |
-
|
| 41 |
-
| Detail | Value |
|
| 42 |
-
|---|---|
|
| 43 |
-
| Architecture | RoBERTa-base |
|
| 44 |
-
| Vocabulary | 32,000 tokens (pure Sindhi BPE) |
|
| 45 |
-
| Hidden size | 768 |
|
| 46 |
-
| Layers | 12 |
|
| 47 |
-
| Attention heads | 12 |
|
| 48 |
-
| Max length | 512 tokens |
|
| 49 |
-
| Parameters | ~110M |
|
| 50 |
-
| Language | Sindhi (sd) |
|
| 51 |
-
| License | MIT |
|
| 52 |
-
|
| 53 |
-
---
|
| 54 |
-
|
| 55 |
-
## Session 3 Training Details
|
| 56 |
-
|
| 57 |
-
| Detail | Value |
|
| 58 |
-
|---|---|
|
| 59 |
-
| Corpus size | 589 MB clean Sindhi text |
|
| 60 |
-
| Total words | ~74 million |
|
| 61 |
-
| Epochs | 2 |
|
| 62 |
-
| Batch size | 64 (effective 256) |
|
| 63 |
-
| Learning rate | 3e-5 |
|
| 64 |
-
| LR scheduler | Cosine decay |
|
| 65 |
-
| Warmup | 5% of total steps |
|
| 66 |
-
| Precision | bf16 (A100) |
|
| 67 |
-
| Gradient clipping | 1.5 |
|
| 68 |
-
| Hardware | H100 GPU |
|
| 69 |
-
| Training time | 224 minutes |
|
| 70 |
-
| Eval loss | 3.348446 |
|
| 71 |
-
| Perplexity | 28.46 |
|
| 72 |
-
|
| 73 |
-
---
|
| 74 |
-
|
| 75 |
-
## Fill-Mask Results — Session 3
|
| 76 |
-
|
| 77 |
-
### ✅ Correct Predictions (8/10)
|
| 78 |
-
|
| 79 |
-
**1. Language identification**
|
| 80 |
-
```
|
| 81 |
-
Input : سنڌي [MASK] دنيا جي قديم ٻولين مان ھڪ آھي
|
| 82 |
-
✅ Top 1 : ٻولي (language) — 40.90%
|
| 83 |
-
Top 2 : ادب (literature) — 7.86%
|
| 84 |
-
Top 3 : ٻوليءَ — 7.20%
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
**2. People context**
|
| 88 |
-
```
|
| 89 |
-
Input : پاڪستان ۾ سنڌي [MASK] گھڻي تعداد ۾ رھن ٿا
|
| 90 |
-
✅ Top 1 : ماڻهو (people) — 33.47%
|
| 91 |
-
Top 2 : سنڌي — 2.65%
|
| 92 |
-
Top 3 : ٻار (children) — 2.63%
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
**3. City identification**
|
| 96 |
-
```
|
| 97 |
-
Input : ڪراچي سنڌ جو سڀ کان وڏو [MASK] آھي
|
| 98 |
-
✅ Top 1 : شھر (city) — 16.72%
|
| 99 |
-
Top 2 : حصو (part) — 7.02%
|
| 100 |
-
Top 3 : ملڪ (country) — 4.06%
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
**4. Direction context**
|
| 104 |
-
```
|
| 105 |
-
Input : ھو پنھنجي [MASK] ڏانھن ويو
|
| 106 |
-
✅ Top 1 : گهر (home) — 11.67%
|
| 107 |
-
Top 2 : ڳوٺ (village) — 6.63%
|
| 108 |
-
Top 3 : منزل (destination) — 5.15%
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
**5. Poet identification**
|
| 112 |
-
```
|
| 113 |
-
Input : شاھه لطيف سنڌي [MASK] جو وڏو شاعر آھي
|
| 114 |
-
✅ Top 1 : شاعريءَ (poetry) — 25.77%
|
| 115 |
-
Top 2 : ٻوليءَ (language) — 25.76%
|
| 116 |
-
Top 3 : ادب (literature) — 13.00%
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
**6. History context**
|
| 120 |
-
```
|
| 121 |
-
Input : سنڌ جي [MASK] ڏاڍي پراڻي آھي
|
| 122 |
-
✅ Top 1 : تاريخ (history) — 16.04%
|
| 123 |
-
Top 2 : ٻولي (language) — 3.88%
|
| 124 |
-
Top 3 : ڌرتي (land) — 3.67%
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
**7. Grammar word**
|
| 128 |
-
```
|
| 129 |
-
Input : دنيا [MASK] گھڻي مصروف آھي
|
| 130 |
-
✅ Top 1 : ۾ (in) — 23.20%
|
| 131 |
-
Top 2 : کي (to) — 17.54%
|
| 132 |
-
Top 3 : جي (of) — 3.71%
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
**8. Education context (close)**
|
| 136 |
-
```
|
| 137 |
-
Input : استاد شاگردن کي [MASK] سيکاري ٿو
|
| 138 |
-
⚠️ Top 1 : استاد (teacher — repeats subject) — 15.87%
|
| 139 |
-
✅ Top 2 : تعليم (education) — 13.70%
|
| 140 |
-
Top 3 : سبق (lesson) — 6.03%
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
---
|
| 144 |
-
|
| 145 |
-
### ❌ Incorrect Predictions (2/10)
|
| 146 |
-
|
| 147 |
-
**9. School context (wrong)**
|
| 148 |
-
```
|
| 149 |
-
Input : ٻار [MASK] ۾ پڙھن ٿا
|
| 150 |
-
❌ Top 1 : گهر (home) — 2.46% ← should be اسڪول (school)
|
| 151 |
-
Top 2 : َ — 2.33% ← diacritic noise
|
| 152 |
-
Top 3 : اکين (eyes) — 2.26%
|
| 153 |
-
Expected : اسڪول (school) ← model needs more school context data
|
| 154 |
-
```
|
| 155 |
-
|
| 156 |
-
**10. River context (close)**
|
| 157 |
-
```
|
| 158 |
-
Input : سنڌو [MASK] سنڌ جي سڀيتا جو مرڪز رھيو آھي
|
| 159 |
-
⚠️ Top 1 : سڀيتا (civilization) — 15.54% ← repeats next word
|
| 160 |
-
✅ Top 2 : ندي (river) — 7.19% ← correct answer
|
| 161 |
-
Top 3 : ۽ (and) — 5.82%
|
| 162 |
-
Expected : ندي (river) ← correct but at Top 2
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
---
|
| 166 |
-
|
| 167 |
-
## Progress Across Sessions
|
| 168 |
-
|
| 169 |
-
| Sentence | Session 1 | Session 2 | Session 3 |
|
| 170 |
-
|---|---|---|---|
|
| 171 |
-
| سنڌي ___ دنيا جي | ✅ ٻولي 15% | ✅ ٻولي 22% | ✅ ٻولي **40.90%** |
|
| 172 |
-
| پاڪستان ۾ سنڌي ___ | ❌ | ✅ ماڻهو 49% | ✅ ماڻهو **33.47%** |
|
| 173 |
-
| ڪراچي سنڌ جو ___ | ✅ Top 3 | ✅ شھر 9% | ✅ شھر **16.72%** |
|
| 174 |
-
| ھو پنھنجي ___ ڏانھن | ⚠️ | ⚠️ | ✅ گهر **11.67%** |
|
| 175 |
-
| شاھه لطيف ___ | ✅ | ✅ | ✅ شاعريءَ **25.77%** |
|
| 176 |
-
| سنڌ جي ___ پراڻي | ✅ Top 2 | ✅ Top 1 | ✅ تاريخ **16.04%** |
|
| 177 |
-
| استاد ___ سيکا��ي | ✅ تعليم | ❌ استاد | ⚠️ Top 2 تعليم |
|
| 178 |
-
| ٻار ___ ۾ پڙھن | ❌ | ❌ | ❌ گهر |
|
| 179 |
-
| دنيا ___ مصروف | ✅ ۾ | ✅ ۾ 38% | ✅ ۾ **23.20%** |
|
| 180 |
-
| سنڌو ___ سنڌ جي | ❌ | ⚠️ Top 4 | ⚠️ Top 2 ندي |
|
| 181 |
-
| **Score** | **50%** | **70%** | **80%** |
|
| 182 |
-
|
| 183 |
-
---
|
| 184 |
-
|
| 185 |
-
## Tokenizer
|
| 186 |
-
|
| 187 |
-
Custom Sindhi BPE tokenizer — every Sindhi word stays as ONE token:
|
| 188 |
-
|
| 189 |
-
```python
|
| 190 |
-
Input : سنڌي ٻولي دنيا جي قديم ٻولين مان ھڪ آھي
|
| 191 |
-
Tokens : ['▁سنڌي', '▁ٻولي', '▁دنيا', '▁جي', '▁قديم', '▁ٻولين', '▁مان', '▁ھڪ', '▁آھي']
|
| 192 |
-
Count : 9 words = 9 tokens ✅
|
| 193 |
-
```
|
| 194 |
-
|
| 195 |
-
Unlike mBERT or XLM-R which split Sindhi words into multiple subword pieces, our tokenizer keeps each Sindhi word as a single token.
|
| 196 |
-
|
| 197 |
-
---
|
| 198 |
-
|
| 199 |
-
## Comparison With Other Models
|
| 200 |
-
|
| 201 |
-
| Model | Type | Perplexity | Fill-mask Quality |
|
| 202 |
-
|---|---|---|---|
|
| 203 |
-
| mBERT fine-tuned | Multilingual | 4.19 | ❌ Predicts punctuation |
|
| 204 |
-
| XLM-R fine-tuned | Multilingual | 5.88 | ✅ 80% correct |
|
| 205 |
-
| SindhiBERT Session 1 | Sindhi only | 78.10 | ✅ 50% |
|
| 206 |
-
| SindhiBERT Session 2 | Sindhi only | 41.62 | ✅ 70% |
|
| 207 |
-
| **SindhiBERT Session 3** | **Sindhi only** | **28.46** | **✅ 80%** |
|
| 208 |
-
|
| 209 |
-
> Note: mBERT/XLM-R perplexity is low because they start from pretrained multilingual weights. SindhiBERT starts from zero and learns pure Sindhi — its predictions are always real Sindhi words, never punctuation or non-Sindhi tokens.
|
| 210 |
-
|
| 211 |
-
---
|
| 212 |
|
| 213 |
## Usage
|
| 214 |
|
| 215 |
```python
|
| 216 |
-
from transformers import
|
| 217 |
-
import sentencepiece as spm
|
| 218 |
-
import torch
|
| 219 |
import torch.nn.functional as F
|
| 220 |
from huggingface_hub import hf_hub_download
|
| 221 |
|
| 222 |
-
|
| 223 |
-
model = AutoModelForMaskedLM.from_pretrained('hellosindh/sindhi-bert-base')
|
| 224 |
-
model.eval()
|
| 225 |
-
|
| 226 |
-
# Load tokenizer
|
| 227 |
-
sp_path = hf_hub_download('hellosindh/sindhi-bert-base', 'sindhi_bpe_32k.model')
|
| 228 |
-
sp = spm.SentencePieceProcessor()
|
| 229 |
-
sp.Load(sp_path)
|
| 230 |
-
|
| 231 |
-
# Constants
|
| 232 |
MASK_ID = 32000
|
| 233 |
BOS_ID = 2
|
| 234 |
EOS_ID = 3
|
| 235 |
-
VOCAB_SIZE = 32000
|
| 236 |
-
|
| 237 |
-
def fill_mask(sentence, top_k=5):
|
| 238 |
-
parts = sentence.split('[MASK]')
|
| 239 |
-
left_ids = sp.EncodeAsIds(parts[0].strip())
|
| 240 |
-
right_ids = sp.EncodeAsIds(parts[1].strip())
|
| 241 |
-
input_ids = [BOS_ID] + left_ids + [MASK_ID] + right_ids + [EOS_ID]
|
| 242 |
-
mask_pos = len(left_ids) + 1
|
| 243 |
-
tensor = torch.tensor([input_ids])
|
| 244 |
-
with torch.no_grad():
|
| 245 |
-
logits = model(tensor).logits[0, mask_pos]
|
| 246 |
-
logits[MASK_ID] = -float('inf')
|
| 247 |
-
probs = F.softmax(logits[:VOCAB_SIZE], dim=-1)
|
| 248 |
-
top_probs, top_ids = torch.topk(probs, top_k)
|
| 249 |
-
for prob, idx in zip(top_probs, top_ids):
|
| 250 |
-
word = sp.IdToPiece(idx.item()).replace('▁', '')
|
| 251 |
-
print(f'{word:<20} {prob.item()*100:.2f}%')
|
| 252 |
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
```
|
| 258 |
-
|
| 259 |
-
---
|
| 260 |
-
|
| 261 |
-
## Roadmap
|
| 262 |
-
|
| 263 |
-
- [x] Custom Sindhi BPE tokenizer (32K vocab)
|
| 264 |
-
- [x] Session 1 — 500K lines, 5 epochs, PPL 78.10
|
| 265 |
-
- [x] Session 2 — 1.5M lines, 3 epochs, PPL 41.62
|
| 266 |
-
- [x] Session 3 — 589MB clean corpus, 2 epochs, PPL 28.46
|
| 267 |
-
- [ ] Session 4 — more data + 3 epochs → target PPL ~18
|
| 268 |
-
- [ ] Session 5 — fine-tune lower LR → target PPL ~12
|
| 269 |
-
- [ ] Spell checker fine-tuning
|
| 270 |
-
- [ ] Next word prediction
|
| 271 |
-
- [ ] Named entity recognition
|
| 272 |
-
- [ ] Sindhi chatbot
|
| 273 |
-
|
| 274 |
-
---
|
| 275 |
-
|
| 276 |
-
## About
|
| 277 |
-
|
| 278 |
-
The corpus was carefully cleaned using a custom pipeline including Unicode normalization, script standardization, he-character normalization (ھ/ه/ہ), and word-level corrections using a 9,355-entry Sindhi dictionary.
|
|
|
|
| 7 |
- bert
|
| 8 |
- masked-language-modeling
|
| 9 |
- from-scratch
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# Sindhi-BERT-base
|
| 13 |
|
| 14 |
+
First BERT-style model trained from scratch on Sindhi text.
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Training History
|
| 17 |
|
| 18 |
+
| Session | Data | Epochs | PPL | Notes |
|
| 19 |
+
|---|---|---|---|---|
|
| 20 |
+
| S1 | 500K lines | 5 | 78.10 | from scratch |
|
| 21 |
+
| S2 | 1.5M lines | 3 | 41.62 | continued |
|
| 22 |
+
| S3 | 1.49M lines | 2 | 28.46 | bf16, cosine LR |
|
| 23 |
+
| S4 | 87M words | 3 | 35.42 | grouped context |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Usage
|
| 26 |
|
| 27 |
```python
|
| 28 |
+
from transformers import RobertaForMaskedLM
|
| 29 |
+
import sentencepiece as spm, torch
|
|
|
|
| 30 |
import torch.nn.functional as F
|
| 31 |
from huggingface_hub import hf_hub_download
|
| 32 |
|
| 33 |
+
REPO = "hellosindh/sindhi-bert-base"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
MASK_ID = 32000
|
| 35 |
BOS_ID = 2
|
| 36 |
EOS_ID = 3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
model = RobertaForMaskedLM.from_pretrained(REPO)
|
| 39 |
+
sp_path = hf_hub_download(REPO, "sindhi_bpe_32k.model")
|
| 40 |
+
sp = spm.SentencePieceProcessor()
|
| 41 |
+
sp.Load(sp_path)
|
| 42 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
checkpoint-3924/config.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_cross_attention": false,
|
| 3 |
+
"architectures": [
|
| 4 |
+
"RobertaForMaskedLM"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 1,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"dtype": "float32",
|
| 10 |
+
"eos_token_id": 2,
|
| 11 |
+
"hidden_act": "gelu",
|
| 12 |
+
"hidden_dropout_prob": 0.1,
|
| 13 |
+
"hidden_size": 768,
|
| 14 |
+
"initializer_range": 0.02,
|
| 15 |
+
"intermediate_size": 3072,
|
| 16 |
+
"is_decoder": false,
|
| 17 |
+
"layer_norm_eps": 1e-12,
|
| 18 |
+
"max_position_embeddings": 514,
|
| 19 |
+
"model_type": "roberta",
|
| 20 |
+
"num_attention_heads": 12,
|
| 21 |
+
"num_hidden_layers": 12,
|
| 22 |
+
"pad_token_id": 0,
|
| 23 |
+
"tie_word_embeddings": true,
|
| 24 |
+
"transformers_version": "5.0.0",
|
| 25 |
+
"type_vocab_size": 1,
|
| 26 |
+
"use_cache": false,
|
| 27 |
+
"vocab_size": 32001
|
| 28 |
+
}
|
checkpoint-3924/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:66c9b40b4d1b2943a622be928e3f8beb231f2cf80d2acbe19352c740edfa76b9
|
| 3 |
+
size 442633860
|
checkpoint-3924/optimizer.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8cdbb31b8e427b2d5c5d5dce127c362cb391d70f8282995b2a405651b6695774
|
| 3 |
+
size 885391563
|
checkpoint-3924/rng_state.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:35f5af9b38d87cb532b16dd4de5175c2910bc86cf1976c6ccc3668da1c53606d
|
| 3 |
+
size 14645
|
checkpoint-3924/scheduler.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8eef8b1a8fe3ca13b13452c68d049d5772a114b25d47fd7c271209bdd37c174b
|
| 3 |
+
size 1465
|
checkpoint-3924/trainer_state.json
ADDED
|
@@ -0,0 +1,332 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"best_global_step": 3924,
|
| 3 |
+
"best_metric": 3.56946063041687,
|
| 4 |
+
"best_model_checkpoint": "sindhibert_session4/checkpoint-3924",
|
| 5 |
+
"epoch": 2.0,
|
| 6 |
+
"eval_steps": 1962,
|
| 7 |
+
"global_step": 3924,
|
| 8 |
+
"is_hyper_param_search": false,
|
| 9 |
+
"is_local_process_zero": true,
|
| 10 |
+
"is_world_process_zero": true,
|
| 11 |
+
"log_history": [
|
| 12 |
+
{
|
| 13 |
+
"epoch": 0.05098139179199592,
|
| 14 |
+
"grad_norm": 4.590001106262207,
|
| 15 |
+
"learning_rate": 5.609065155807366e-06,
|
| 16 |
+
"loss": 15.86372314453125,
|
| 17 |
+
"step": 100
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"epoch": 0.10196278358399184,
|
| 21 |
+
"grad_norm": 5.000253677368164,
|
| 22 |
+
"learning_rate": 1.1274787535410765e-05,
|
| 23 |
+
"loss": 15.6683056640625,
|
| 24 |
+
"step": 200
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"epoch": 0.15294417537598776,
|
| 28 |
+
"grad_norm": 5.164661407470703,
|
| 29 |
+
"learning_rate": 1.6940509915014164e-05,
|
| 30 |
+
"loss": 15.58547607421875,
|
| 31 |
+
"step": 300
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"epoch": 0.20392556716798368,
|
| 35 |
+
"grad_norm": 4.895200729370117,
|
| 36 |
+
"learning_rate": 1.999658933249201e-05,
|
| 37 |
+
"loss": 15.5261376953125,
|
| 38 |
+
"step": 400
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"epoch": 0.2549069589599796,
|
| 42 |
+
"grad_norm": 5.010247707366943,
|
| 43 |
+
"learning_rate": 1.9965659596003744e-05,
|
| 44 |
+
"loss": 15.493291015625,
|
| 45 |
+
"step": 500
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"epoch": 0.3058883507519755,
|
| 49 |
+
"grad_norm": 4.85853910446167,
|
| 50 |
+
"learning_rate": 1.990261043359342e-05,
|
| 51 |
+
"loss": 15.43971435546875,
|
| 52 |
+
"step": 600
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"epoch": 0.35686974254397147,
|
| 56 |
+
"grad_norm": 4.788653373718262,
|
| 57 |
+
"learning_rate": 1.9807645053376055e-05,
|
| 58 |
+
"loss": 15.409666748046876,
|
| 59 |
+
"step": 700
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"epoch": 0.40785113433596737,
|
| 63 |
+
"grad_norm": 4.742185592651367,
|
| 64 |
+
"learning_rate": 1.968106952977309e-05,
|
| 65 |
+
"loss": 15.346304931640624,
|
| 66 |
+
"step": 800
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"epoch": 0.45883252612796327,
|
| 70 |
+
"grad_norm": 4.758422374725342,
|
| 71 |
+
"learning_rate": 1.9523291817031276e-05,
|
| 72 |
+
"loss": 15.344024658203125,
|
| 73 |
+
"step": 900
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"epoch": 0.5098139179199592,
|
| 77 |
+
"grad_norm": 4.854381084442139,
|
| 78 |
+
"learning_rate": 1.933482043438185e-05,
|
| 79 |
+
"loss": 15.307811279296875,
|
| 80 |
+
"step": 1000
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"epoch": 0.5607953097119551,
|
| 84 |
+
"grad_norm": 4.7934041023254395,
|
| 85 |
+
"learning_rate": 1.9116262827077703e-05,
|
| 86 |
+
"loss": 15.254422607421875,
|
| 87 |
+
"step": 1100
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"epoch": 0.611776701503951,
|
| 91 |
+
"grad_norm": 4.670731544494629,
|
| 92 |
+
"learning_rate": 1.88683234085909e-05,
|
| 93 |
+
"loss": 15.23345703125,
|
| 94 |
+
"step": 1200
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"epoch": 0.6627580932959469,
|
| 98 |
+
"grad_norm": 4.993561267852783,
|
| 99 |
+
"learning_rate": 1.8591801290280664e-05,
|
| 100 |
+
"loss": 15.2450927734375,
|
| 101 |
+
"step": 1300
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"epoch": 0.7137394850879429,
|
| 105 |
+
"grad_norm": 4.720964431762695,
|
| 106 |
+
"learning_rate": 1.8287587705849013e-05,
|
| 107 |
+
"loss": 15.1839599609375,
|
| 108 |
+
"step": 1400
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"epoch": 0.7647208768799388,
|
| 112 |
+
"grad_norm": 5.050419330596924,
|
| 113 |
+
"learning_rate": 1.7956663138885173e-05,
|
| 114 |
+
"loss": 15.164833984375,
|
| 115 |
+
"step": 1500
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"epoch": 0.8157022686719347,
|
| 119 |
+
"grad_norm": 4.826648712158203,
|
| 120 |
+
"learning_rate": 1.760009416275661e-05,
|
| 121 |
+
"loss": 15.130496826171875,
|
| 122 |
+
"step": 1600
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"epoch": 0.8666836604639306,
|
| 126 |
+
"grad_norm": 4.858438014984131,
|
| 127 |
+
"learning_rate": 1.721903000303185e-05,
|
| 128 |
+
"loss": 15.125797119140625,
|
| 129 |
+
"step": 1700
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"epoch": 0.9176650522559265,
|
| 133 |
+
"grad_norm": 4.9611430168151855,
|
| 134 |
+
"learning_rate": 1.6814698833514326e-05,
|
| 135 |
+
"loss": 15.13617431640625,
|
| 136 |
+
"step": 1800
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"epoch": 0.9686464440479226,
|
| 140 |
+
"grad_norm": 4.663859844207764,
|
| 141 |
+
"learning_rate": 1.63884038178253e-05,
|
| 142 |
+
"loss": 15.072591552734375,
|
| 143 |
+
"step": 1900
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"epoch": 1.0,
|
| 147 |
+
"eval_loss": 3.636704444885254,
|
| 148 |
+
"eval_runtime": 8.0138,
|
| 149 |
+
"eval_samples_per_second": 632.91,
|
| 150 |
+
"eval_steps_per_second": 9.983,
|
| 151 |
+
"step": 1962
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"epoch": 1.0193729288809585,
|
| 155 |
+
"grad_norm": 4.863068103790283,
|
| 156 |
+
"learning_rate": 1.5941518909293737e-05,
|
| 157 |
+
"loss": 14.968798828125,
|
| 158 |
+
"step": 2000
|
| 159 |
+
},
|
| 160 |
+
{
|
| 161 |
+
"epoch": 1.0703543206729544,
|
| 162 |
+
"grad_norm": 5.036495685577393,
|
| 163 |
+
"learning_rate": 1.5475484422690282e-05,
|
| 164 |
+
"loss": 15.0290869140625,
|
| 165 |
+
"step": 2100
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"epoch": 1.1213357124649503,
|
| 169 |
+
"grad_norm": 5.248174667358398,
|
| 170 |
+
"learning_rate": 1.4991802392077543e-05,
|
| 171 |
+
"loss": 15.004036865234376,
|
| 172 |
+
"step": 2200
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"epoch": 1.1723171042569462,
|
| 176 |
+
"grad_norm": 4.950564384460449,
|
| 177 |
+
"learning_rate": 1.4492031729738489e-05,
|
| 178 |
+
"loss": 15.002611083984375,
|
| 179 |
+
"step": 2300
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"epoch": 1.2232984960489421,
|
| 183 |
+
"grad_norm": 4.509192943572998,
|
| 184 |
+
"learning_rate": 1.3977783201785732e-05,
|
| 185 |
+
"loss": 14.96060302734375,
|
| 186 |
+
"step": 2400
|
| 187 |
+
},
|
| 188 |
+
{
|
| 189 |
+
"epoch": 1.274279887840938,
|
| 190 |
+
"grad_norm": 4.900182723999023,
|
| 191 |
+
"learning_rate": 1.3450714236645352e-05,
|
| 192 |
+
"loss": 14.971297607421874,
|
| 193 |
+
"step": 2500
|
| 194 |
+
},
|
| 195 |
+
{
|
| 196 |
+
"epoch": 1.325261279632934,
|
| 197 |
+
"grad_norm": 5.138764381408691,
|
| 198 |
+
"learning_rate": 1.2912523583147625e-05,
|
| 199 |
+
"loss": 14.928385009765625,
|
| 200 |
+
"step": 2600
|
| 201 |
+
},
|
| 202 |
+
{
|
| 203 |
+
"epoch": 1.3762426714249298,
|
| 204 |
+
"grad_norm": 4.894199848175049,
|
| 205 |
+
"learning_rate": 1.2364945835441636e-05,
|
| 206 |
+
"loss": 14.938167724609375,
|
| 207 |
+
"step": 2700
|
| 208 |
+
},
|
| 209 |
+
{
|
| 210 |
+
"epoch": 1.4272240632169257,
|
| 211 |
+
"grad_norm": 4.8737921714782715,
|
| 212 |
+
"learning_rate": 1.1809745842380042e-05,
|
| 213 |
+
"loss": 14.923902587890625,
|
| 214 |
+
"step": 2800
|
| 215 |
+
},
|
| 216 |
+
{
|
| 217 |
+
"epoch": 1.4782054550089216,
|
| 218 |
+
"grad_norm": 4.8258819580078125,
|
| 219 |
+
"learning_rate": 1.1248713019392635e-05,
|
| 220 |
+
"loss": 14.89677001953125,
|
| 221 |
+
"step": 2900
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"epoch": 1.5291868468009175,
|
| 225 |
+
"grad_norm": 4.769787788391113,
|
| 226 |
+
"learning_rate": 1.0683655581181524e-05,
|
| 227 |
+
"loss": 14.87692626953125,
|
| 228 |
+
"step": 3000
|
| 229 |
+
},
|
| 230 |
+
{
|
| 231 |
+
"epoch": 1.5801682385929134,
|
| 232 |
+
"grad_norm": 4.92316198348999,
|
| 233 |
+
"learning_rate": 1.0116394713826117e-05,
|
| 234 |
+
"loss": 14.849693603515625,
|
| 235 |
+
"step": 3100
|
| 236 |
+
},
|
| 237 |
+
{
|
| 238 |
+
"epoch": 1.6311496303849093,
|
| 239 |
+
"grad_norm": 4.873258590698242,
|
| 240 |
+
"learning_rate": 9.548758705081177e-06,
|
| 241 |
+
"loss": 14.833634033203126,
|
| 242 |
+
"step": 3200
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"epoch": 1.6821310221769055,
|
| 246 |
+
"grad_norm": 4.738825798034668,
|
| 247 |
+
"learning_rate": 8.98257705178612e-06,
|
| 248 |
+
"loss": 14.85665283203125,
|
| 249 |
+
"step": 3300
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
"epoch": 1.7331124139689014,
|
| 253 |
+
"grad_norm": 4.907736778259277,
|
| 254 |
+
"learning_rate": 8.419674563377416e-06,
|
| 255 |
+
"loss": 14.8664599609375,
|
| 256 |
+
"step": 3400
|
| 257 |
+
},
|
| 258 |
+
{
|
| 259 |
+
"epoch": 1.7840938057608973,
|
| 260 |
+
"grad_norm": 4.977413177490234,
|
| 261 |
+
"learning_rate": 7.861865480508541e-06,
|
| 262 |
+
"loss": 14.83008056640625,
|
| 263 |
+
"step": 3500
|
| 264 |
+
},
|
| 265 |
+
{
|
| 266 |
+
"epoch": 1.8350751975528932,
|
| 267 |
+
"grad_norm": 4.792273044586182,
|
| 268 |
+
"learning_rate": 7.310947627733231e-06,
|
| 269 |
+
"loss": 14.81404541015625,
|
| 270 |
+
"step": 3600
|
| 271 |
+
},
|
| 272 |
+
{
|
| 273 |
+
"epoch": 1.886056589344889,
|
| 274 |
+
"grad_norm": 4.84648323059082,
|
| 275 |
+
"learning_rate": 6.768696619097996e-06,
|
| 276 |
+
"loss": 14.831793212890625,
|
| 277 |
+
"step": 3700
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"epoch": 1.9370379811368852,
|
| 281 |
+
"grad_norm": 4.854404449462891,
|
| 282 |
+
"learning_rate": 6.236860135319321e-06,
|
| 283 |
+
"loss": 14.826976318359375,
|
| 284 |
+
"step": 3800
|
| 285 |
+
},
|
| 286 |
+
{
|
| 287 |
+
"epoch": 1.988019372928881,
|
| 288 |
+
"grad_norm": 4.615888595581055,
|
| 289 |
+
"learning_rate": 5.717152290990302e-06,
|
| 290 |
+
"loss": 14.767562255859374,
|
| 291 |
+
"step": 3900
|
| 292 |
+
},
|
| 293 |
+
{
|
| 294 |
+
"epoch": 2.0,
|
| 295 |
+
"eval_loss": 3.56946063041687,
|
| 296 |
+
"eval_runtime": 8.0481,
|
| 297 |
+
"eval_samples_per_second": 630.208,
|
| 298 |
+
"eval_steps_per_second": 9.94,
|
| 299 |
+
"step": 3924
|
| 300 |
+
}
|
| 301 |
+
],
|
| 302 |
+
"logging_steps": 100,
|
| 303 |
+
"max_steps": 5886,
|
| 304 |
+
"num_input_tokens_seen": 0,
|
| 305 |
+
"num_train_epochs": 3,
|
| 306 |
+
"save_steps": 1962,
|
| 307 |
+
"stateful_callbacks": {
|
| 308 |
+
"EarlyStoppingCallback": {
|
| 309 |
+
"args": {
|
| 310 |
+
"early_stopping_patience": 3,
|
| 311 |
+
"early_stopping_threshold": 0.0
|
| 312 |
+
},
|
| 313 |
+
"attributes": {
|
| 314 |
+
"early_stopping_patience_counter": 0
|
| 315 |
+
}
|
| 316 |
+
},
|
| 317 |
+
"TrainerControl": {
|
| 318 |
+
"args": {
|
| 319 |
+
"should_epoch_stop": false,
|
| 320 |
+
"should_evaluate": false,
|
| 321 |
+
"should_log": false,
|
| 322 |
+
"should_save": true,
|
| 323 |
+
"should_training_stop": false
|
| 324 |
+
},
|
| 325 |
+
"attributes": {}
|
| 326 |
+
}
|
| 327 |
+
},
|
| 328 |
+
"total_flos": 2.643322074019246e+17,
|
| 329 |
+
"train_batch_size": 64,
|
| 330 |
+
"trial_name": null,
|
| 331 |
+
"trial_params": null
|
| 332 |
+
}
|
checkpoint-3924/training_args.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
|
| 3 |
+
size 5137
|
checkpoint-5886/config.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_cross_attention": false,
|
| 3 |
+
"architectures": [
|
| 4 |
+
"RobertaForMaskedLM"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 1,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"dtype": "float32",
|
| 10 |
+
"eos_token_id": 2,
|
| 11 |
+
"hidden_act": "gelu",
|
| 12 |
+
"hidden_dropout_prob": 0.1,
|
| 13 |
+
"hidden_size": 768,
|
| 14 |
+
"initializer_range": 0.02,
|
| 15 |
+
"intermediate_size": 3072,
|
| 16 |
+
"is_decoder": false,
|
| 17 |
+
"layer_norm_eps": 1e-12,
|
| 18 |
+
"max_position_embeddings": 514,
|
| 19 |
+
"model_type": "roberta",
|
| 20 |
+
"num_attention_heads": 12,
|
| 21 |
+
"num_hidden_layers": 12,
|
| 22 |
+
"pad_token_id": 0,
|
| 23 |
+
"tie_word_embeddings": true,
|
| 24 |
+
"transformers_version": "5.0.0",
|
| 25 |
+
"type_vocab_size": 1,
|
| 26 |
+
"use_cache": false,
|
| 27 |
+
"vocab_size": 32001
|
| 28 |
+
}
|
checkpoint-5886/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:170894fbff2599922589dc645dfc871455543fe1f1fa33d3381f8353cf0b2a5b
|
| 3 |
+
size 442633860
|
checkpoint-5886/optimizer.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:49f438e933e34f365171b080043f51c3931028fb9b12b84462700e4fec8ed022
|
| 3 |
+
size 885391563
|
checkpoint-5886/rng_state.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5b568051719bceb1b41126c825c8846c1625bce2c01817c9c4450273020cfb29
|
| 3 |
+
size 14645
|
checkpoint-5886/scheduler.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a8e0fb9255f3eabc9bbca3c948e4f71fe410e407e554e26add1a06864fa8f902
|
| 3 |
+
size 1465
|
checkpoint-5886/trainer_state.json
ADDED
|
@@ -0,0 +1,473 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"best_global_step": 5886,
|
| 3 |
+
"best_metric": 3.5591108798980713,
|
| 4 |
+
"best_model_checkpoint": "sindhibert_session4/checkpoint-5886",
|
| 5 |
+
"epoch": 3.0,
|
| 6 |
+
"eval_steps": 1962,
|
| 7 |
+
"global_step": 5886,
|
| 8 |
+
"is_hyper_param_search": false,
|
| 9 |
+
"is_local_process_zero": true,
|
| 10 |
+
"is_world_process_zero": true,
|
| 11 |
+
"log_history": [
|
| 12 |
+
{
|
| 13 |
+
"epoch": 0.05098139179199592,
|
| 14 |
+
"grad_norm": 4.590001106262207,
|
| 15 |
+
"learning_rate": 5.609065155807366e-06,
|
| 16 |
+
"loss": 15.86372314453125,
|
| 17 |
+
"step": 100
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"epoch": 0.10196278358399184,
|
| 21 |
+
"grad_norm": 5.000253677368164,
|
| 22 |
+
"learning_rate": 1.1274787535410765e-05,
|
| 23 |
+
"loss": 15.6683056640625,
|
| 24 |
+
"step": 200
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"epoch": 0.15294417537598776,
|
| 28 |
+
"grad_norm": 5.164661407470703,
|
| 29 |
+
"learning_rate": 1.6940509915014164e-05,
|
| 30 |
+
"loss": 15.58547607421875,
|
| 31 |
+
"step": 300
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"epoch": 0.20392556716798368,
|
| 35 |
+
"grad_norm": 4.895200729370117,
|
| 36 |
+
"learning_rate": 1.999658933249201e-05,
|
| 37 |
+
"loss": 15.5261376953125,
|
| 38 |
+
"step": 400
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"epoch": 0.2549069589599796,
|
| 42 |
+
"grad_norm": 5.010247707366943,
|
| 43 |
+
"learning_rate": 1.9965659596003744e-05,
|
| 44 |
+
"loss": 15.493291015625,
|
| 45 |
+
"step": 500
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"epoch": 0.3058883507519755,
|
| 49 |
+
"grad_norm": 4.85853910446167,
|
| 50 |
+
"learning_rate": 1.990261043359342e-05,
|
| 51 |
+
"loss": 15.43971435546875,
|
| 52 |
+
"step": 600
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"epoch": 0.35686974254397147,
|
| 56 |
+
"grad_norm": 4.788653373718262,
|
| 57 |
+
"learning_rate": 1.9807645053376055e-05,
|
| 58 |
+
"loss": 15.409666748046876,
|
| 59 |
+
"step": 700
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"epoch": 0.40785113433596737,
|
| 63 |
+
"grad_norm": 4.742185592651367,
|
| 64 |
+
"learning_rate": 1.968106952977309e-05,
|
| 65 |
+
"loss": 15.346304931640624,
|
| 66 |
+
"step": 800
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"epoch": 0.45883252612796327,
|
| 70 |
+
"grad_norm": 4.758422374725342,
|
| 71 |
+
"learning_rate": 1.9523291817031276e-05,
|
| 72 |
+
"loss": 15.344024658203125,
|
| 73 |
+
"step": 900
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"epoch": 0.5098139179199592,
|
| 77 |
+
"grad_norm": 4.854381084442139,
|
| 78 |
+
"learning_rate": 1.933482043438185e-05,
|
| 79 |
+
"loss": 15.307811279296875,
|
| 80 |
+
"step": 1000
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"epoch": 0.5607953097119551,
|
| 84 |
+
"grad_norm": 4.7934041023254395,
|
| 85 |
+
"learning_rate": 1.9116262827077703e-05,
|
| 86 |
+
"loss": 15.254422607421875,
|
| 87 |
+
"step": 1100
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"epoch": 0.611776701503951,
|
| 91 |
+
"grad_norm": 4.670731544494629,
|
| 92 |
+
"learning_rate": 1.88683234085909e-05,
|
| 93 |
+
"loss": 15.23345703125,
|
| 94 |
+
"step": 1200
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"epoch": 0.6627580932959469,
|
| 98 |
+
"grad_norm": 4.993561267852783,
|
| 99 |
+
"learning_rate": 1.8591801290280664e-05,
|
| 100 |
+
"loss": 15.2450927734375,
|
| 101 |
+
"step": 1300
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"epoch": 0.7137394850879429,
|
| 105 |
+
"grad_norm": 4.720964431762695,
|
| 106 |
+
"learning_rate": 1.8287587705849013e-05,
|
| 107 |
+
"loss": 15.1839599609375,
|
| 108 |
+
"step": 1400
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"epoch": 0.7647208768799388,
|
| 112 |
+
"grad_norm": 5.050419330596924,
|
| 113 |
+
"learning_rate": 1.7956663138885173e-05,
|
| 114 |
+
"loss": 15.164833984375,
|
| 115 |
+
"step": 1500
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"epoch": 0.8157022686719347,
|
| 119 |
+
"grad_norm": 4.826648712158203,
|
| 120 |
+
"learning_rate": 1.760009416275661e-05,
|
| 121 |
+
"loss": 15.130496826171875,
|
| 122 |
+
"step": 1600
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"epoch": 0.8666836604639306,
|
| 126 |
+
"grad_norm": 4.858438014984131,
|
| 127 |
+
"learning_rate": 1.721903000303185e-05,
|
| 128 |
+
"loss": 15.125797119140625,
|
| 129 |
+
"step": 1700
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"epoch": 0.9176650522559265,
|
| 133 |
+
"grad_norm": 4.9611430168151855,
|
| 134 |
+
"learning_rate": 1.6814698833514326e-05,
|
| 135 |
+
"loss": 15.13617431640625,
|
| 136 |
+
"step": 1800
|
| 137 |
+
},
|
| 138 |
+
{
|
| 139 |
+
"epoch": 0.9686464440479226,
|
| 140 |
+
"grad_norm": 4.663859844207764,
|
| 141 |
+
"learning_rate": 1.63884038178253e-05,
|
| 142 |
+
"loss": 15.072591552734375,
|
| 143 |
+
"step": 1900
|
| 144 |
+
},
|
| 145 |
+
{
|
| 146 |
+
"epoch": 1.0,
|
| 147 |
+
"eval_loss": 3.636704444885254,
|
| 148 |
+
"eval_runtime": 8.0138,
|
| 149 |
+
"eval_samples_per_second": 632.91,
|
| 150 |
+
"eval_steps_per_second": 9.983,
|
| 151 |
+
"step": 1962
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"epoch": 1.0193729288809585,
|
| 155 |
+
"grad_norm": 4.863068103790283,
|
| 156 |
+
"learning_rate": 1.5941518909293737e-05,
|
| 157 |
+
"loss": 14.968798828125,
|
| 158 |
+
"step": 2000
|
| 159 |
+
},
|
| 160 |
+
{
|
| 161 |
+
"epoch": 1.0703543206729544,
|
| 162 |
+
"grad_norm": 5.036495685577393,
|
| 163 |
+
"learning_rate": 1.5475484422690282e-05,
|
| 164 |
+
"loss": 15.0290869140625,
|
| 165 |
+
"step": 2100
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"epoch": 1.1213357124649503,
|
| 169 |
+
"grad_norm": 5.248174667358398,
|
| 170 |
+
"learning_rate": 1.4991802392077543e-05,
|
| 171 |
+
"loss": 15.004036865234376,
|
| 172 |
+
"step": 2200
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"epoch": 1.1723171042569462,
|
| 176 |
+
"grad_norm": 4.950564384460449,
|
| 177 |
+
"learning_rate": 1.4492031729738489e-05,
|
| 178 |
+
"loss": 15.002611083984375,
|
| 179 |
+
"step": 2300
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"epoch": 1.2232984960489421,
|
| 183 |
+
"grad_norm": 4.509192943572998,
|
| 184 |
+
"learning_rate": 1.3977783201785732e-05,
|
| 185 |
+
"loss": 14.96060302734375,
|
| 186 |
+
"step": 2400
|
| 187 |
+
},
|
| 188 |
+
{
|
| 189 |
+
"epoch": 1.274279887840938,
|
| 190 |
+
"grad_norm": 4.900182723999023,
|
| 191 |
+
"learning_rate": 1.3450714236645352e-05,
|
| 192 |
+
"loss": 14.971297607421874,
|
| 193 |
+
"step": 2500
|
| 194 |
+
},
|
| 195 |
+
{
|
| 196 |
+
"epoch": 1.325261279632934,
|
| 197 |
+
"grad_norm": 5.138764381408691,
|
| 198 |
+
"learning_rate": 1.2912523583147625e-05,
|
| 199 |
+
"loss": 14.928385009765625,
|
| 200 |
+
"step": 2600
|
| 201 |
+
},
|
| 202 |
+
{
|
| 203 |
+
"epoch": 1.3762426714249298,
|
| 204 |
+
"grad_norm": 4.894199848175049,
|
| 205 |
+
"learning_rate": 1.2364945835441636e-05,
|
| 206 |
+
"loss": 14.938167724609375,
|
| 207 |
+
"step": 2700
|
| 208 |
+
},
|
| 209 |
+
{
|
| 210 |
+
"epoch": 1.4272240632169257,
|
| 211 |
+
"grad_norm": 4.8737921714782715,
|
| 212 |
+
"learning_rate": 1.1809745842380042e-05,
|
| 213 |
+
"loss": 14.923902587890625,
|
| 214 |
+
"step": 2800
|
| 215 |
+
},
|
| 216 |
+
{
|
| 217 |
+
"epoch": 1.4782054550089216,
|
| 218 |
+
"grad_norm": 4.8258819580078125,
|
| 219 |
+
"learning_rate": 1.1248713019392635e-05,
|
| 220 |
+
"loss": 14.89677001953125,
|
| 221 |
+
"step": 2900
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"epoch": 1.5291868468009175,
|
| 225 |
+
"grad_norm": 4.769787788391113,
|
| 226 |
+
"learning_rate": 1.0683655581181524e-05,
|
| 227 |
+
"loss": 14.87692626953125,
|
| 228 |
+
"step": 3000
|
| 229 |
+
},
|
| 230 |
+
{
|
| 231 |
+
"epoch": 1.5801682385929134,
|
| 232 |
+
"grad_norm": 4.92316198348999,
|
| 233 |
+
"learning_rate": 1.0116394713826117e-05,
|
| 234 |
+
"loss": 14.849693603515625,
|
| 235 |
+
"step": 3100
|
| 236 |
+
},
|
| 237 |
+
{
|
| 238 |
+
"epoch": 1.6311496303849093,
|
| 239 |
+
"grad_norm": 4.873258590698242,
|
| 240 |
+
"learning_rate": 9.548758705081177e-06,
|
| 241 |
+
"loss": 14.833634033203126,
|
| 242 |
+
"step": 3200
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"epoch": 1.6821310221769055,
|
| 246 |
+
"grad_norm": 4.738825798034668,
|
| 247 |
+
"learning_rate": 8.98257705178612e-06,
|
| 248 |
+
"loss": 14.85665283203125,
|
| 249 |
+
"step": 3300
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
"epoch": 1.7331124139689014,
|
| 253 |
+
"grad_norm": 4.907736778259277,
|
| 254 |
+
"learning_rate": 8.419674563377416e-06,
|
| 255 |
+
"loss": 14.8664599609375,
|
| 256 |
+
"step": 3400
|
| 257 |
+
},
|
| 258 |
+
{
|
| 259 |
+
"epoch": 1.7840938057608973,
|
| 260 |
+
"grad_norm": 4.977413177490234,
|
| 261 |
+
"learning_rate": 7.861865480508541e-06,
|
| 262 |
+
"loss": 14.83008056640625,
|
| 263 |
+
"step": 3500
|
| 264 |
+
},
|
| 265 |
+
{
|
| 266 |
+
"epoch": 1.8350751975528932,
|
| 267 |
+
"grad_norm": 4.792273044586182,
|
| 268 |
+
"learning_rate": 7.310947627733231e-06,
|
| 269 |
+
"loss": 14.81404541015625,
|
| 270 |
+
"step": 3600
|
| 271 |
+
},
|
| 272 |
+
{
|
| 273 |
+
"epoch": 1.886056589344889,
|
| 274 |
+
"grad_norm": 4.84648323059082,
|
| 275 |
+
"learning_rate": 6.768696619097996e-06,
|
| 276 |
+
"loss": 14.831793212890625,
|
| 277 |
+
"step": 3700
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"epoch": 1.9370379811368852,
|
| 281 |
+
"grad_norm": 4.854404449462891,
|
| 282 |
+
"learning_rate": 6.236860135319321e-06,
|
| 283 |
+
"loss": 14.826976318359375,
|
| 284 |
+
"step": 3800
|
| 285 |
+
},
|
| 286 |
+
{
|
| 287 |
+
"epoch": 1.988019372928881,
|
| 288 |
+
"grad_norm": 4.615888595581055,
|
| 289 |
+
"learning_rate": 5.717152290990302e-06,
|
| 290 |
+
"loss": 14.767562255859374,
|
| 291 |
+
"step": 3900
|
| 292 |
+
},
|
| 293 |
+
{
|
| 294 |
+
"epoch": 2.0,
|
| 295 |
+
"eval_loss": 3.56946063041687,
|
| 296 |
+
"eval_runtime": 8.0481,
|
| 297 |
+
"eval_samples_per_second": 630.208,
|
| 298 |
+
"eval_steps_per_second": 9.94,
|
| 299 |
+
"step": 3924
|
| 300 |
+
},
|
| 301 |
+
{
|
| 302 |
+
"epoch": 2.038745857761917,
|
| 303 |
+
"grad_norm": 5.015805721282959,
|
| 304 |
+
"learning_rate": 5.211248109971254e-06,
|
| 305 |
+
"loss": 14.695634765625,
|
| 306 |
+
"step": 4000
|
| 307 |
+
},
|
| 308 |
+
{
|
| 309 |
+
"epoch": 2.089727249553913,
|
| 310 |
+
"grad_norm": 4.800245761871338,
|
| 311 |
+
"learning_rate": 4.720778126770141e-06,
|
| 312 |
+
"loss": 14.764068603515625,
|
| 313 |
+
"step": 4100
|
| 314 |
+
},
|
| 315 |
+
{
|
| 316 |
+
"epoch": 2.140708641345909,
|
| 317 |
+
"grad_norm": 4.756154537200928,
|
| 318 |
+
"learning_rate": 4.247323131312676e-06,
|
| 319 |
+
"loss": 14.755054931640625,
|
| 320 |
+
"step": 4200
|
| 321 |
+
},
|
| 322 |
+
{
|
| 323 |
+
"epoch": 2.191690033137905,
|
| 324 |
+
"grad_norm": 4.989803314208984,
|
| 325 |
+
"learning_rate": 3.7924090740397178e-06,
|
| 326 |
+
"loss": 14.760721435546875,
|
| 327 |
+
"step": 4300
|
| 328 |
+
},
|
| 329 |
+
{
|
| 330 |
+
"epoch": 2.2426714249299007,
|
| 331 |
+
"grad_norm": 4.568801403045654,
|
| 332 |
+
"learning_rate": 3.3575021477529313e-06,
|
| 333 |
+
"loss": 14.72455810546875,
|
| 334 |
+
"step": 4400
|
| 335 |
+
},
|
| 336 |
+
{
|
| 337 |
+
"epoch": 2.2936528167218966,
|
| 338 |
+
"grad_norm": 4.871072769165039,
|
| 339 |
+
"learning_rate": 2.944004062059924e-06,
|
| 340 |
+
"loss": 14.743800048828126,
|
| 341 |
+
"step": 4500
|
| 342 |
+
},
|
| 343 |
+
{
|
| 344 |
+
"epoch": 2.3446342085138925,
|
| 345 |
+
"grad_norm": 4.790256500244141,
|
| 346 |
+
"learning_rate": 2.5532475256494073e-06,
|
| 347 |
+
"loss": 14.7241162109375,
|
| 348 |
+
"step": 4600
|
| 349 |
+
},
|
| 350 |
+
{
|
| 351 |
+
"epoch": 2.3956156003058884,
|
| 352 |
+
"grad_norm": 4.770144462585449,
|
| 353 |
+
"learning_rate": 2.186491950957048e-06,
|
| 354 |
+
"loss": 14.711162109375,
|
| 355 |
+
"step": 4700
|
| 356 |
+
},
|
| 357 |
+
{
|
| 358 |
+
"epoch": 2.4465969920978843,
|
| 359 |
+
"grad_norm": 4.44427490234375,
|
| 360 |
+
"learning_rate": 1.8449193950659018e-06,
|
| 361 |
+
"loss": 14.72890625,
|
| 362 |
+
"step": 4800
|
| 363 |
+
},
|
| 364 |
+
{
|
| 365 |
+
"epoch": 2.49757838388988,
|
| 366 |
+
"grad_norm": 4.664465427398682,
|
| 367 |
+
"learning_rate": 1.5296307499239903e-06,
|
| 368 |
+
"loss": 14.713804931640626,
|
| 369 |
+
"step": 4900
|
| 370 |
+
},
|
| 371 |
+
{
|
| 372 |
+
"epoch": 2.548559775681876,
|
| 373 |
+
"grad_norm": 4.861291408538818,
|
| 374 |
+
"learning_rate": 1.2416421941579448e-06,
|
| 375 |
+
"loss": 14.730694580078126,
|
| 376 |
+
"step": 5000
|
| 377 |
+
},
|
| 378 |
+
{
|
| 379 |
+
"epoch": 2.599541167473872,
|
| 380 |
+
"grad_norm": 4.662012577056885,
|
| 381 |
+
"learning_rate": 9.818819179185713e-07,
|
| 382 |
+
"loss": 14.70477294921875,
|
| 383 |
+
"step": 5100
|
| 384 |
+
},
|
| 385 |
+
{
|
| 386 |
+
"epoch": 2.650522559265868,
|
| 387 |
+
"grad_norm": 4.803001403808594,
|
| 388 |
+
"learning_rate": 7.511871313142238e-07,
|
| 389 |
+
"loss": 14.7314208984375,
|
| 390 |
+
"step": 5200
|
| 391 |
+
},
|
| 392 |
+
{
|
| 393 |
+
"epoch": 2.701503951057864,
|
| 394 |
+
"grad_norm": 4.746646404266357,
|
| 395 |
+
"learning_rate": 5.503013660737899e-07,
|
| 396 |
+
"loss": 14.70580810546875,
|
| 397 |
+
"step": 5300
|
| 398 |
+
},
|
| 399 |
+
{
|
| 400 |
+
"epoch": 2.7524853428498597,
|
| 401 |
+
"grad_norm": 4.867108345031738,
|
| 402 |
+
"learning_rate": 3.798720791360988e-07,
|
| 403 |
+
"loss": 14.710306396484375,
|
| 404 |
+
"step": 5400
|
| 405 |
+
},
|
| 406 |
+
{
|
| 407 |
+
"epoch": 2.8034667346418556,
|
| 408 |
+
"grad_norm": 4.6949992179870605,
|
| 409 |
+
"learning_rate": 2.404485658893807e-07,
|
| 410 |
+
"loss": 14.725491943359375,
|
| 411 |
+
"step": 5500
|
| 412 |
+
},
|
| 413 |
+
{
|
| 414 |
+
"epoch": 2.8544481264338515,
|
| 415 |
+
"grad_norm": 4.641607284545898,
|
| 416 |
+
"learning_rate": 1.3248018978643695e-07,
|
| 417 |
+
"loss": 14.7078369140625,
|
| 418 |
+
"step": 5600
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"epoch": 2.905429518225848,
|
| 422 |
+
"grad_norm": 4.756202220916748,
|
| 423 |
+
"learning_rate": 5.6314934041501455e-08,
|
| 424 |
+
"loss": 14.697396240234376,
|
| 425 |
+
"step": 5700
|
| 426 |
+
},
|
| 427 |
+
{
|
| 428 |
+
"epoch": 2.9564109100178433,
|
| 429 |
+
"grad_norm": 4.691574573516846,
|
| 430 |
+
"learning_rate": 1.2198280076668455e-08,
|
| 431 |
+
"loss": 14.694278564453125,
|
| 432 |
+
"step": 5800
|
| 433 |
+
},
|
| 434 |
+
{
|
| 435 |
+
"epoch": 3.0,
|
| 436 |
+
"eval_loss": 3.5591108798980713,
|
| 437 |
+
"eval_runtime": 8.0338,
|
| 438 |
+
"eval_samples_per_second": 631.333,
|
| 439 |
+
"eval_steps_per_second": 9.958,
|
| 440 |
+
"step": 5886
|
| 441 |
+
}
|
| 442 |
+
],
|
| 443 |
+
"logging_steps": 100,
|
| 444 |
+
"max_steps": 5886,
|
| 445 |
+
"num_input_tokens_seen": 0,
|
| 446 |
+
"num_train_epochs": 3,
|
| 447 |
+
"save_steps": 1962,
|
| 448 |
+
"stateful_callbacks": {
|
| 449 |
+
"EarlyStoppingCallback": {
|
| 450 |
+
"args": {
|
| 451 |
+
"early_stopping_patience": 3,
|
| 452 |
+
"early_stopping_threshold": 0.0
|
| 453 |
+
},
|
| 454 |
+
"attributes": {
|
| 455 |
+
"early_stopping_patience_counter": 0
|
| 456 |
+
}
|
| 457 |
+
},
|
| 458 |
+
"TrainerControl": {
|
| 459 |
+
"args": {
|
| 460 |
+
"should_epoch_stop": false,
|
| 461 |
+
"should_evaluate": false,
|
| 462 |
+
"should_log": false,
|
| 463 |
+
"should_save": true,
|
| 464 |
+
"should_training_stop": true
|
| 465 |
+
},
|
| 466 |
+
"attributes": {}
|
| 467 |
+
}
|
| 468 |
+
},
|
| 469 |
+
"total_flos": 3.964983111028869e+17,
|
| 470 |
+
"train_batch_size": 64,
|
| 471 |
+
"trial_name": null,
|
| 472 |
+
"trial_params": null
|
| 473 |
+
}
|
checkpoint-5886/training_args.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
|
| 3 |
+
size 5137
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 442633860
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:170894fbff2599922589dc645dfc871455543fe1f1fa33d3381f8353cf0b2a5b
|
| 3 |
size 442633860
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 5137
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:accc825ca2e280888c9eed825fcb7985700c1fb466ed8b16208ff9e7b14f1318
|
| 3 |
size 5137
|