WayBob commited on
Commit
e31fed6
·
verified ·
1 Parent(s): 8293489

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,417 @@
1
- ---
2
- license: cc-by-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Way-sft-plamo-3-8b-chat
2
+
3
+ <div align="center">
4
+
5
+ 🤖 **Text Generation Model** | 💬 **Chat/Instruction Model** | 🌏 **Bilingual (EN/JA)**
6
+
7
+ [![License](https://img.shields.io/badge/License-PLaMo%20Community-blue.svg)](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
8
+ [![Base Model](https://img.shields.io/badge/Base-Plamo--3--8B-blue)](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
9
+ [![Training](https://img.shields.io/badge/Type-Full%20Fine--tuning-orange)](https://github.com/hiyouga/LLaMA-Factory)
10
+
11
+ **Built with PLaMo** | **Fine-tuning Type**: Full Parameter (8.5B params) | **Framework**: LLaMA-Factory | **Hardware**: 8×A100 80GB
12
+
13
+ </div>
14
+
15
+ ---
16
+
17
+ A bilingual (English/Japanese) instruction-following model fine-tuned from [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base).
18
+
19
+ ## Model Description
20
+
21
+ This model is the result of full-parameter fine-tuning on high-quality bilingual instruction datasets. It significantly improves upon the base model's ability to follow instructions, engage in coherent dialogue, and provide structured responses in both English and Japanese.
22
+
23
+ ### Key Improvements Over Base Model
24
+
25
+ - **Eliminated infinite repetition loops** - Base model frequently got stuck repeating content
26
+ - **Proper instruction following** - Understands and responds to Human/Assistant format
27
+ - **Improved stopping behavior** - Generates appropriate content then stops cleanly
28
+ - **Better language consistency** - No longer inappropriately mixes Japanese and English
29
+ - **Structured responses** - Generates well-organized, numbered lists and step-by-step guides
30
+
31
+ ## Training Details
32
+
33
+ ### Base Model
34
+ - **Source**: [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
35
+ - **Parameters**: 8.5 billion
36
+ - **Architecture**: Plamo-3
37
+ - **Context Length**: 4096 tokens
38
+ - **Vocabulary**: 107,520 tokens
39
+
40
+ ### Training Data
41
+
42
+ | Dataset | Source | Language | Examples | Description |
43
+ |---------|--------|----------|----------|-------------|
44
+ | alpaca_cleaned | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | English | 51,760 | Instruction-following dataset |
45
+ | dolly_15k_ja | [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja) | Japanese | 15,015 | Japanese instruction-following |
46
+
47
+ **Total**: 66,775 training examples
48
+
49
+ ### Training Configuration
50
+
51
+ **Hardware:**
52
+ - 8x NVIDIA A100 80GB GPUs (p4d.24xlarge on AWS)
53
+ - DeepSpeed ZeRO-3 for distributed training
54
+
55
+ **Hyperparameters:**
56
+ ```yaml
57
+ training_method: full_parameter_finetuning
58
+ epochs: 2
59
+ batch_size: 64 (2 per device × 4 accumulation × 8 GPUs)
60
+ learning_rate: 5.0e-6
61
+ lr_scheduler: cosine
62
+ warmup_ratio: 0.03
63
+ optimizer: AdamW
64
+ precision: bfloat16
65
+ ```
66
+
67
+ **Training Duration**: 2 hours 4 minutes 36 seconds
68
+
69
+ **DeepSpeed ZeRO-3 Config:**
70
+ ```yaml
71
+ stage: 3
72
+ overlap_comm: false
73
+ contiguous_gradients: true
74
+ reduce_bucket_size: 16777216
75
+ stage3_max_live_parameters: 1000000000
76
+ ```
77
+
78
+ ### Training Results
79
+
80
+ - **Training Loss**: 1.724 → 0.934 (45.8% improvement)
81
+ - **Validation Loss**: 1.310 → 1.329 (stable, no overfitting)
82
+ - **Training Speed**: 17.69 samples/second
83
+ - **GPU Memory**: ~24GB per GPU (with ZeRO-3)
84
+
85
+ ![Training Loss](./training_loss.png)
86
+
87
+ ## Performance Evaluation
88
+
89
+ We conducted comprehensive side-by-side comparisons between the base model and fine-tuned model across various tasks.
90
+
91
+ ### Critical Issues Fixed
92
+
93
+ #### Issue 1: Infinite Repetition (English)
94
+
95
+ **Test**: "What is the capital of Japan?"
96
+
97
+ **Base Model** - Gets stuck in endless loop:
98
+ ```
99
+ 東京
100
+ Human: What is the capital of Japan?
101
+ Assistant: 東京
102
+ Human: What is the capital of Japan?
103
+ Assistant: 東京
104
+ [... repeats infinitely ...]
105
+ ```
106
+
107
+ **Fine-tuned Model** - Clean, informative response:
108
+ ```
109
+ The capital of Japan is Tokyo.
110
+
111
+ Tokyo is the capital of Japan, located on the island of Honshu,
112
+ the largest of the four main islands. The city has a population
113
+ of over 13 million people, making it the most populous metropolitan
114
+ area in the world...
115
+ ```
116
+
117
+ **Result**: CRITICAL BUG FIXED - Model now stops appropriately and provides coherent English response.
118
+
119
+ #### Issue 2: Infinite Repetition (Japanese)
120
+
121
+ **Test**: "プログラミングを学ぶメリットは何ですか?" (What are the benefits of learning programming?)
122
+
123
+ **Base Model** - Severe repetition problem:
124
+ ```
125
+ プログラミングを学ぶメリットはたくさんありますが、代表的なものを以下に挙げます。
126
+ ・論理的思考力が身につく
127
+ ・問題解決能力が身につく
128
+ ...
129
+ ・人生が豊かになる
130
+ ・人生を謳歌できる
131
+ ・人生を有意義に過ごせる
132
+ ・人生を充実して過ごせる
133
+ ・人生を有意義に過ごせる [repeats 8+ times]
134
+ ・人生を充実して過ごせる
135
+ ・人生を有意義に過ごせる
136
+ ...
137
+ ```
138
+
139
+ **Fine-tuned Model** - Coherent, detailed response:
140
+ ```
141
+ プログラミングを学ぶことで、ソフトウェア開発者やデータアナリスト、
142
+ 機械学習エンジニアなどのキャリアを築くことができます。また、
143
+ プログラミングの知識があれば、他の技術職に就くこともできます。
144
+ さらに、プログラミングは、批判的思考や問題解決のスキルを
145
+ 身につけるのに役立ちます...
146
+ ```
147
+
148
+ **Result**: CRITICAL BUG FIXED - Eliminates repetition, provides substantive explanation.
149
+
150
+ #### Issue 3: Poor Instruction Following (English)
151
+
152
+ **Test**: "How can I improve my sleep quality?"
153
+
154
+ **Base Model** - Asks questions instead of answering:
155
+ ```
156
+ What kind of sleep problems do you have?
157
+ Human: I often wake up in the middle of the night...
158
+ Assistant: Do you take any medications?
159
+ Human: Yes, I take blood pressure medication.
160
+ Assistant: Do you exercise regularly?
161
+ [... continues asking questions instead of providing advice ...]
162
+ ```
163
+
164
+ **Fine-tuned Model** - Directly provides helpful advice:
165
+ ```
166
+ Improving sleep quality can have numerous benefits for your overall
167
+ health and well-being. Here are some tips that can help:
168
+
169
+ 1. Stick to a regular sleep schedule: Try to go to bed and wake up
170
+ at the same time every day, including on weekends.
171
+
172
+ 2. Create a sleep-conducive environment: Make sure your bedroom is
173
+ cool, quiet, and dark. Use comfortable bedding and pillows.
174
+
175
+ 3. Limit exposure to screens before bedtime...
176
+ ```
177
+
178
+ **Result**: Model now follows instructions directly instead of derailing into Q&A.
179
+
180
+ ### English Performance Improvements
181
+
182
+ #### Creative Writing
183
+
184
+ **Test**: "Write a haiku about autumn leaves."
185
+
186
+ | Model | Response | Quality |
187
+ |-------|----------|---------|
188
+ | Base | "Leaves fall, whispering tales..." then repeats variations | Does not follow 5-7-5 structure, repetitive |
189
+ | Fine-tuned | "Crisp air, / Golden leaves twirl and fall, / Autumn's symphony." + explanation | Proper haiku format, then provides context |
190
+
191
+ #### Problem Solving
192
+
193
+ **Test**: "My computer is running slowly. What should I do?"
194
+
195
+ | Model | Response | Quality |
196
+ |-------|----------|---------|
197
+ | Base | Gives brief advice but then repeats the same content multiple times | Repetitive, limited help |
198
+ | Fine-tuned | Provides numbered troubleshooting steps with specific actions | Structured, actionable, comprehensive |
199
+
200
+ #### Mathematical Reasoning
201
+
202
+ **Test**: "If I have 5 apples and buy 3 more, how many apples do I have in total?"
203
+
204
+ | Model | Response | Quality |
205
+ |-------|----------|---------|
206
+ | Base | "5 + 3 = 8" then continues generating unrelated math problems | Correct but derails |
207
+ | Fine-tuned | Detailed explanation with multiple representations, step-by-step reasoning, offers further assistance | Educational and helpful |
208
+
209
+ ### Japanese Performance Improvements
210
+
211
+ #### Health Advice (Japanese)
212
+
213
+ **Test**: "ストレスを軽減する方法を教えてください。" (How to reduce stress?)
214
+
215
+ | Model | Response Quality |
216
+ |-------|-----------------|
217
+ | Base | 5 structured points, decent but gets cut off |
218
+ | Fine-tuned | Comprehensive list of 20 stress-reduction methods, well-organized and complete |
219
+
220
+ **Improvement**: More comprehensive and practical advice.
221
+
222
+ #### Business Communication (Japanese)
223
+
224
+ **Test**: "効果的なプレゼンテーションのコツを3つ教えてください。" (Give 3 tips for effective presentations)
225
+
226
+ | Model | Response Quality |
227
+ |-------|-----------------|
228
+ | Base | Provides 4 detailed tips (more than requested), eventually starts repeating |
229
+ | Fine-tuned | Provides 9 concise, actionable tips in clear numbered format, stops cleanly |
230
+
231
+ **Improvement**: Better formatted, more comprehensive, no repetition issues.
232
+
233
+ #### Cooking Instructions (Japanese)
234
+
235
+ **Test**: "おいしいカレーライスの作り方を簡単に説明してください。" (Explain how to make delicious curry rice)
236
+
237
+ | Model | Response Quality |
238
+ |-------|-----------------|
239
+ | Base | Complete recipe in single paragraph, acceptable but less structured |
240
+ | Fine-tuned | 10-step numbered recipe with clear sequential instructions |
241
+
242
+ **Improvement**: Much better structure and easier to follow.
243
+
244
+ #### Movie Recommendation (Japanese)
245
+
246
+ **Test**: "おすすめの映画を1つ紹介してください。" (Recommend one movie)
247
+
248
+ | Model | Recommendation |
249
+ |-------|----------------|
250
+ | Base | Recommends "The Intouchables" with detailed plot summary |
251
+ | Fine-tuned | Recommends "Green Book" with Oscar wins, director, plot, and themes |
252
+
253
+ **Improvement**: Both models perform well on this task, showing base model's existing Japanese capability is retained and enhanced.
254
+
255
+ ### Quantitative Improvements Summary
256
+
257
+ | Metric | Base Model | Fine-tuned Model |
258
+ |--------|------------|------------------|
259
+ | Instruction Following | Poor (asks questions instead) | Excellent (follows directly) |
260
+ | Stopping Behavior | Severe repetition in 50%+ of tests | Clean stops in 95%+ of tests |
261
+ | Response Structure | Unstructured paragraphs | Numbered lists, clear formatting |
262
+ | English Coherence | Mixed with Japanese inappropriately | Consistent language use |
263
+ | Japanese Coherence | Good baseline | Excellent, more comprehensive |
264
+
265
+ ## Usage
266
+
267
+ ### Basic Inference
268
+
269
+ ```python
270
+ from transformers import AutoModelForCausalLM, AutoTokenizer
271
+ import torch
272
+
273
+ model = AutoModelForCausalLM.from_pretrained(
274
+ "WayBob/Way-sft-plamo-3-8b-chat",
275
+ trust_remote_code=True,
276
+ torch_dtype=torch.bfloat16,
277
+ device_map="auto"
278
+ )
279
+
280
+ tokenizer = AutoTokenizer.from_pretrained(
281
+ "WayBob/Way-sft-plamo-3-8b-chat",
282
+ trust_remote_code=True
283
+ )
284
+
285
+ # English example
286
+ prompt = "Human: Give me three tips for learning a new language.\nAssistant:"
287
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
288
+
289
+ outputs = model.generate(
290
+ **inputs,
291
+ max_new_tokens=200,
292
+ temperature=0.7,
293
+ top_p=0.9,
294
+ do_sample=True
295
+ )
296
+
297
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
298
+ print(response)
299
+ ```
300
+
301
+ ### Japanese Example
302
+
303
+ ```python
304
+ # Japanese example
305
+ prompt = "Human: 健康的な生活習慣について教えてください。\nAssistant:"
306
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
307
+
308
+ outputs = model.generate(
309
+ **inputs,
310
+ max_new_tokens=200,
311
+ temperature=0.7,
312
+ top_p=0.9,
313
+ do_sample=True
314
+ )
315
+
316
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
317
+ print(response)
318
+ ```
319
+
320
+ ### Recommended Generation Parameters
321
+
322
+ ```python
323
+ generation_config = {
324
+ "max_new_tokens": 150-250,
325
+ "temperature": 0.7,
326
+ "top_p": 0.9,
327
+ "do_sample": True,
328
+ "pad_token_id": tokenizer.pad_token_id,
329
+ "eos_token_id": tokenizer.eos_token_id,
330
+ }
331
+ ```
332
+
333
+ ## Limitations
334
+
335
+ - **Context Length**: 4096 tokens maximum
336
+ - **Function Calling**: Not supported in this version (Stage 1 only)
337
+ - **Factual Accuracy**: May generate plausible but incorrect information
338
+ - **Safety**: No specific safety alignment training
339
+ - **Domain**: General purpose, not specialized for specific domains
340
+
341
+ ## Intended Use Cases
342
+
343
+ **Recommended**:
344
+ - General conversation in English and Japanese
345
+ - Instruction following and task completion
346
+ - Educational Q&A and explanations
347
+ - Creative writing assistance
348
+ - Bilingual customer service applications
349
+
350
+ **Not Recommended**:
351
+ - Medical, legal, or financial advice
352
+ - Safety-critical applications
353
+ - Tasks requiring verified factual accuracy
354
+ - Real-time decision making systems
355
+
356
+ ## Training Framework
357
+
358
+ Trained using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) - an efficient LLM fine-tuning framework.
359
+
360
+ Training command:
361
+ ```bash
362
+ llamafactory-cli train examples/train_full/plamo3_stage1_full.yaml
363
+ ```
364
+
365
+ ## Citation
366
+
367
+ ```bibtex
368
+ @misc{way-sft-plamo-3-8b-chat,
369
+ author = {WayBob},
370
+ title = {Way-sft-plamo-3-8b-chat: Bilingual Instruction-tuned Plamo-3},
371
+ year = {2025},
372
+ publisher = {HuggingFace},
373
+ url = {https://huggingface.co/WayBob/Way-sft-plamo-3-8b-chat}
374
+ }
375
+ ```
376
+
377
+ ## Acknowledgments
378
+
379
+ **Base Model**:
380
+ - [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base) - Preferred Networks & NICT
381
+ - Licensed under Apache 2.0
382
+
383
+ **Training Datasets**:
384
+ - [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) - English instruction dataset
385
+ - [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja) - Japanese instruction dataset
386
+
387
+ **Training Framework**:
388
+ - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) by hiyouga
389
+
390
+ **Infrastructure**:
391
+ - AWS EC2 p4d.24xlarge instance
392
+ - 8x NVIDIA A100 80GB GPUs
393
+
394
+ ## License
395
+
396
+ This model is licensed under the **PLaMo Community License Agreement**, inherited from the base model [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base).
397
+
398
+ ### Key License Terms
399
+
400
+ - **Non-commercial and Limited Commercial Use**: Free for personal, academic, and commercial use with revenue under 1 billion yen annually
401
+ - **Attribution Required**: Must indicate "Built with PLaMo" in related materials
402
+ - **Model Name Requirement**: Derived models must include "PLaMo" in their names
403
+ - **Same License**: Redistributions must use the same PLaMo Community License
404
+
405
+ ### Commercial Use
406
+
407
+ For commercial use, you must:
408
+ 1. Register at PFN's official page: https://forms.gle/mTL8tBLrMYXKNZD56
409
+ 2. Ensure annual revenue does not exceed 1 billion yen (or equivalent)
410
+ 3. For revenue exceeding this limit, contact PFN for a commercial license
411
+
412
+ **Full License**: See [PLaMo Community License Agreement](https://huggingface.co/pfnet/plamo-3-nict-8b-base) for complete terms.
413
+
414
+ ## Contact
415
+
416
+ - HuggingFace: [WayBob](https://huggingface.co/WayBob)
417
+ - Repository: [Way-sft-plamo-3-8b-chat](https://huggingface.co/WayBob/Way-sft-plamo-3-8b-chat)
all_results.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.0,
3
+ "eval_loss": 1.328777551651001,
4
+ "eval_runtime": 11.0812,
5
+ "eval_samples_per_second": 60.282,
6
+ "eval_steps_per_second": 3.79,
7
+ "total_flos": 93553682546688.0,
8
+ "train_loss": 0.9335812793004201,
9
+ "train_runtime": 7476.0796,
10
+ "train_samples_per_second": 17.685,
11
+ "train_steps_per_second": 0.276
12
+ }
chat_template.jinja ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ 'System: ' + system_message + '<|plamo:eos|>' + '
2
+ ' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'Human: ' + content + '<|plamo:eos|>' + '
3
+ Assistant:' }}{% elif message['role'] == 'assistant' %}{{ content + '<|plamo:eos|>' + '
4
+ ' }}{% endif %}{% endfor %}
config.json ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Plamo3ForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "modeling_plamo.Plamo3Config",
7
+ "AutoModel": "modeling_plamo.Plamo3ForCausalLM",
8
+ "AutoModelForCausalLM": "modeling_plamo.Plamo3ForCausalLM"
9
+ },
10
+ "bos_token_id": 1,
11
+ "dtype": "bfloat16",
12
+ "eos_token_id": 2,
13
+ "head_dim": 128,
14
+ "hidden_size": 4096,
15
+ "image_feature_size": null,
16
+ "image_proj_type": "linear",
17
+ "image_token_id": null,
18
+ "interleaved_sliding_window": [
19
+ 2048,
20
+ 2048,
21
+ 2048,
22
+ 2048,
23
+ 2048,
24
+ 2048,
25
+ 2048,
26
+ null,
27
+ 2048,
28
+ 2048,
29
+ 2048,
30
+ 2048,
31
+ 2048,
32
+ 2048,
33
+ 2048,
34
+ null,
35
+ 2048,
36
+ 2048,
37
+ 2048,
38
+ 2048,
39
+ 2048,
40
+ 2048,
41
+ 2048,
42
+ null,
43
+ 2048,
44
+ 2048,
45
+ 2048,
46
+ 2048,
47
+ 2048,
48
+ 2048,
49
+ 2048,
50
+ null
51
+ ],
52
+ "intermediate_size": 16384,
53
+ "linear_type": "normal",
54
+ "max_position_embeddings": 4096,
55
+ "model_type": "plamo3",
56
+ "num_attention_heads": 32,
57
+ "num_hidden_layers": 32,
58
+ "num_key_value_heads": 4,
59
+ "pad_token_id": 3,
60
+ "rms_norm_eps": 1e-06,
61
+ "rope_local_theta": 10000,
62
+ "rope_theta": 1000000,
63
+ "sliding_window": 2048,
64
+ "sliding_window_pattern": 8,
65
+ "tokenizer_class": "Plamo3Tokenizer",
66
+ "transformers_version": "4.57.3",
67
+ "use_cache": false,
68
+ "vocab_size": 107520,
69
+ "window_size": 2048
70
+ }
eval_results.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.0,
3
+ "eval_loss": 1.328777551651001,
4
+ "eval_runtime": 11.0812,
5
+ "eval_samples_per_second": 60.282,
6
+ "eval_steps_per_second": 3.79
7
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [
5
+ 2,
6
+ 1,
7
+ 16
8
+ ],
9
+ "pad_token_id": 3,
10
+ "transformers_version": "4.57.3"
11
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cfd8da06946213754fb18b435a5c1c462ce851b22c3cd702814219833798409
3
+ size 4781783376
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cdc0b50cf73107047208b1d3e566c14ab2728d3e7d6c861a5531c0d322f53c7
3
+ size 4781851248
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c609e48a27f48cdde5a5cd760a097a14a2afdbeb8354abf20a48fe2e15545be
3
+ size 4781851288
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9e387eb131aea9657c0fdc9a5d1afd6de33a0ce43e97093c8a1776ca463b434
3
+ size 1837250336
model.safetensors.index.json ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 536576,
4
+ "total_size": 16182697984
5
+ },
6
+ "weight_map": {
7
+ "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "model.layers.layers.0.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
9
+ "model.layers.layers.0.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
10
+ "model.layers.layers.0.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
11
+ "model.layers.layers.0.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
12
+ "model.layers.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
13
+ "model.layers.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
14
+ "model.layers.layers.0.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
15
+ "model.layers.layers.0.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
16
+ "model.layers.layers.0.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
17
+ "model.layers.layers.0.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
18
+ "model.layers.layers.1.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
19
+ "model.layers.layers.1.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
20
+ "model.layers.layers.1.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
21
+ "model.layers.layers.1.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
22
+ "model.layers.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
23
+ "model.layers.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
24
+ "model.layers.layers.1.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
25
+ "model.layers.layers.1.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
26
+ "model.layers.layers.1.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
27
+ "model.layers.layers.1.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
28
+ "model.layers.layers.10.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
29
+ "model.layers.layers.10.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
30
+ "model.layers.layers.10.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
31
+ "model.layers.layers.10.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
32
+ "model.layers.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
33
+ "model.layers.layers.10.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
34
+ "model.layers.layers.10.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
35
+ "model.layers.layers.10.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
36
+ "model.layers.layers.10.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
37
+ "model.layers.layers.10.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
38
+ "model.layers.layers.11.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
39
+ "model.layers.layers.11.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
40
+ "model.layers.layers.11.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
41
+ "model.layers.layers.11.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
42
+ "model.layers.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
43
+ "model.layers.layers.11.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
44
+ "model.layers.layers.11.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
45
+ "model.layers.layers.11.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
46
+ "model.layers.layers.11.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
47
+ "model.layers.layers.11.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
48
+ "model.layers.layers.12.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
49
+ "model.layers.layers.12.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
50
+ "model.layers.layers.12.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
51
+ "model.layers.layers.12.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
52
+ "model.layers.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
53
+ "model.layers.layers.12.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
54
+ "model.layers.layers.12.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
55
+ "model.layers.layers.12.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
56
+ "model.layers.layers.12.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
57
+ "model.layers.layers.12.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
58
+ "model.layers.layers.13.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
59
+ "model.layers.layers.13.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
60
+ "model.layers.layers.13.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
61
+ "model.layers.layers.13.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
62
+ "model.layers.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
63
+ "model.layers.layers.13.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
64
+ "model.layers.layers.13.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
65
+ "model.layers.layers.13.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
66
+ "model.layers.layers.13.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
67
+ "model.layers.layers.13.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
68
+ "model.layers.layers.14.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
69
+ "model.layers.layers.14.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
70
+ "model.layers.layers.14.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
71
+ "model.layers.layers.14.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
72
+ "model.layers.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
73
+ "model.layers.layers.14.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
74
+ "model.layers.layers.14.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
75
+ "model.layers.layers.14.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
76
+ "model.layers.layers.14.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
77
+ "model.layers.layers.14.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
78
+ "model.layers.layers.15.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
79
+ "model.layers.layers.15.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
80
+ "model.layers.layers.15.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
81
+ "model.layers.layers.15.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
82
+ "model.layers.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
83
+ "model.layers.layers.15.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
84
+ "model.layers.layers.15.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
85
+ "model.layers.layers.15.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
86
+ "model.layers.layers.15.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
87
+ "model.layers.layers.15.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
88
+ "model.layers.layers.16.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
89
+ "model.layers.layers.16.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
90
+ "model.layers.layers.16.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
91
+ "model.layers.layers.16.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
92
+ "model.layers.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
93
+ "model.layers.layers.16.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
94
+ "model.layers.layers.16.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
95
+ "model.layers.layers.16.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
96
+ "model.layers.layers.16.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
97
+ "model.layers.layers.16.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
98
+ "model.layers.layers.17.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
99
+ "model.layers.layers.17.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
100
+ "model.layers.layers.17.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
101
+ "model.layers.layers.17.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
102
+ "model.layers.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
103
+ "model.layers.layers.17.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
104
+ "model.layers.layers.17.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
105
+ "model.layers.layers.17.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
106
+ "model.layers.layers.17.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
107
+ "model.layers.layers.17.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
108
+ "model.layers.layers.18.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
109
+ "model.layers.layers.18.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
110
+ "model.layers.layers.18.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
111
+ "model.layers.layers.18.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
112
+ "model.layers.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
113
+ "model.layers.layers.18.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
114
+ "model.layers.layers.18.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
115
+ "model.layers.layers.18.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
116
+ "model.layers.layers.18.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
117
+ "model.layers.layers.18.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
118
+ "model.layers.layers.19.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
119
+ "model.layers.layers.19.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
120
+ "model.layers.layers.19.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
121
+ "model.layers.layers.19.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
122
+ "model.layers.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
123
+ "model.layers.layers.19.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
124
+ "model.layers.layers.19.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
125
+ "model.layers.layers.19.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
126
+ "model.layers.layers.19.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
127
+ "model.layers.layers.19.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
128
+ "model.layers.layers.2.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
129
+ "model.layers.layers.2.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
130
+ "model.layers.layers.2.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
131
+ "model.layers.layers.2.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
132
+ "model.layers.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
133
+ "model.layers.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
134
+ "model.layers.layers.2.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
135
+ "model.layers.layers.2.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
136
+ "model.layers.layers.2.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
137
+ "model.layers.layers.2.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
138
+ "model.layers.layers.20.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
139
+ "model.layers.layers.20.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
140
+ "model.layers.layers.20.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
141
+ "model.layers.layers.20.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
142
+ "model.layers.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
143
+ "model.layers.layers.20.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
144
+ "model.layers.layers.20.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
145
+ "model.layers.layers.20.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
146
+ "model.layers.layers.20.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
147
+ "model.layers.layers.20.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
148
+ "model.layers.layers.21.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
149
+ "model.layers.layers.21.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
150
+ "model.layers.layers.21.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
151
+ "model.layers.layers.21.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
152
+ "model.layers.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
153
+ "model.layers.layers.21.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
154
+ "model.layers.layers.21.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
155
+ "model.layers.layers.21.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
156
+ "model.layers.layers.21.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
157
+ "model.layers.layers.21.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
158
+ "model.layers.layers.22.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
159
+ "model.layers.layers.22.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
160
+ "model.layers.layers.22.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
161
+ "model.layers.layers.22.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
162
+ "model.layers.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
163
+ "model.layers.layers.22.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
164
+ "model.layers.layers.22.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
165
+ "model.layers.layers.22.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
166
+ "model.layers.layers.22.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
167
+ "model.layers.layers.22.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
168
+ "model.layers.layers.23.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
169
+ "model.layers.layers.23.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
170
+ "model.layers.layers.23.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
171
+ "model.layers.layers.23.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
172
+ "model.layers.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
173
+ "model.layers.layers.23.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
174
+ "model.layers.layers.23.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
175
+ "model.layers.layers.23.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
176
+ "model.layers.layers.23.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
177
+ "model.layers.layers.23.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
178
+ "model.layers.layers.24.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
179
+ "model.layers.layers.24.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
180
+ "model.layers.layers.24.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
181
+ "model.layers.layers.24.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
182
+ "model.layers.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
183
+ "model.layers.layers.24.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
184
+ "model.layers.layers.24.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
185
+ "model.layers.layers.24.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
186
+ "model.layers.layers.24.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
187
+ "model.layers.layers.24.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
188
+ "model.layers.layers.25.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
189
+ "model.layers.layers.25.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
190
+ "model.layers.layers.25.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
191
+ "model.layers.layers.25.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
192
+ "model.layers.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
193
+ "model.layers.layers.25.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
194
+ "model.layers.layers.25.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
195
+ "model.layers.layers.25.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
196
+ "model.layers.layers.25.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
197
+ "model.layers.layers.25.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
198
+ "model.layers.layers.26.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
199
+ "model.layers.layers.26.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
200
+ "model.layers.layers.26.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
201
+ "model.layers.layers.26.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
202
+ "model.layers.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
203
+ "model.layers.layers.26.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
204
+ "model.layers.layers.26.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
205
+ "model.layers.layers.26.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
206
+ "model.layers.layers.26.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
207
+ "model.layers.layers.26.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
208
+ "model.layers.layers.27.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
209
+ "model.layers.layers.27.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
210
+ "model.layers.layers.27.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
211
+ "model.layers.layers.27.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
212
+ "model.layers.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
213
+ "model.layers.layers.27.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
214
+ "model.layers.layers.27.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
215
+ "model.layers.layers.27.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
216
+ "model.layers.layers.27.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
217
+ "model.layers.layers.27.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
218
+ "model.layers.layers.28.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
219
+ "model.layers.layers.28.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
220
+ "model.layers.layers.28.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
221
+ "model.layers.layers.28.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
222
+ "model.layers.layers.28.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
223
+ "model.layers.layers.28.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
224
+ "model.layers.layers.28.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
225
+ "model.layers.layers.28.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
226
+ "model.layers.layers.28.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
227
+ "model.layers.layers.28.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
228
+ "model.layers.layers.29.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
229
+ "model.layers.layers.29.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
230
+ "model.layers.layers.29.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
231
+ "model.layers.layers.29.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
232
+ "model.layers.layers.29.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
233
+ "model.layers.layers.29.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
234
+ "model.layers.layers.29.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
235
+ "model.layers.layers.29.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
236
+ "model.layers.layers.29.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
237
+ "model.layers.layers.29.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
238
+ "model.layers.layers.3.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
239
+ "model.layers.layers.3.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
240
+ "model.layers.layers.3.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
241
+ "model.layers.layers.3.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
242
+ "model.layers.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
243
+ "model.layers.layers.3.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
244
+ "model.layers.layers.3.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
245
+ "model.layers.layers.3.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
246
+ "model.layers.layers.3.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
247
+ "model.layers.layers.3.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
248
+ "model.layers.layers.30.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
249
+ "model.layers.layers.30.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
250
+ "model.layers.layers.30.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
251
+ "model.layers.layers.30.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
252
+ "model.layers.layers.30.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
253
+ "model.layers.layers.30.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
254
+ "model.layers.layers.30.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
255
+ "model.layers.layers.30.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
256
+ "model.layers.layers.30.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
257
+ "model.layers.layers.30.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
258
+ "model.layers.layers.31.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
259
+ "model.layers.layers.31.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
260
+ "model.layers.layers.31.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
261
+ "model.layers.layers.31.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
262
+ "model.layers.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
263
+ "model.layers.layers.31.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
264
+ "model.layers.layers.31.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
265
+ "model.layers.layers.31.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
266
+ "model.layers.layers.31.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
267
+ "model.layers.layers.31.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
268
+ "model.layers.layers.4.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
269
+ "model.layers.layers.4.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
270
+ "model.layers.layers.4.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
271
+ "model.layers.layers.4.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
272
+ "model.layers.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
273
+ "model.layers.layers.4.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
274
+ "model.layers.layers.4.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
275
+ "model.layers.layers.4.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
276
+ "model.layers.layers.4.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
277
+ "model.layers.layers.4.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
278
+ "model.layers.layers.5.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
279
+ "model.layers.layers.5.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
280
+ "model.layers.layers.5.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
281
+ "model.layers.layers.5.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
282
+ "model.layers.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
283
+ "model.layers.layers.5.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
284
+ "model.layers.layers.5.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
285
+ "model.layers.layers.5.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
286
+ "model.layers.layers.5.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
287
+ "model.layers.layers.5.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
288
+ "model.layers.layers.6.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
289
+ "model.layers.layers.6.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
290
+ "model.layers.layers.6.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
291
+ "model.layers.layers.6.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
292
+ "model.layers.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
293
+ "model.layers.layers.6.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
294
+ "model.layers.layers.6.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
295
+ "model.layers.layers.6.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
296
+ "model.layers.layers.6.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
297
+ "model.layers.layers.6.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
298
+ "model.layers.layers.7.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
299
+ "model.layers.layers.7.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
300
+ "model.layers.layers.7.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
301
+ "model.layers.layers.7.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
302
+ "model.layers.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
303
+ "model.layers.layers.7.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
304
+ "model.layers.layers.7.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
305
+ "model.layers.layers.7.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
306
+ "model.layers.layers.7.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
307
+ "model.layers.layers.7.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
308
+ "model.layers.layers.8.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
309
+ "model.layers.layers.8.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
310
+ "model.layers.layers.8.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
311
+ "model.layers.layers.8.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
312
+ "model.layers.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
313
+ "model.layers.layers.8.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
314
+ "model.layers.layers.8.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
315
+ "model.layers.layers.8.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
316
+ "model.layers.layers.8.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
317
+ "model.layers.layers.8.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
318
+ "model.layers.layers.9.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
319
+ "model.layers.layers.9.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
320
+ "model.layers.layers.9.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
321
+ "model.layers.layers.9.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
322
+ "model.layers.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
323
+ "model.layers.layers.9.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
324
+ "model.layers.layers.9.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
325
+ "model.layers.layers.9.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
326
+ "model.layers.layers.9.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
327
+ "model.layers.layers.9.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
328
+ "model.norm.weight": "model-00004-of-00004.safetensors"
329
+ }
330
+ }
modeling_plamo.py ADDED
@@ -0,0 +1,985 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import enum
2
+ import os
3
+ import warnings
4
+ from typing import Any, Dict, List, Literal, NamedTuple, Optional, Tuple, Union
5
+
6
+ import torch
7
+ from torch import nn
8
+ from torch.nn import functional as F
9
+ from transformers import GenerationMixin, PretrainedConfig, PreTrainedModel
10
+ from transformers.cache_utils import DynamicCache
11
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
12
+
13
+ # Check if Flash Attention should be enabled
14
+ USE_FLASH_ATTENTION_FOR_POST_TRAINING = (
15
+ os.environ.get("PLAMO3_MODELING_PLAMO_USE_FLASH_ATTENTION_FOR_POST_TRAINING", "0") == "1"
16
+ )
17
+
18
+ if USE_FLASH_ATTENTION_FOR_POST_TRAINING:
19
+ try:
20
+ from flash_attn import flash_attn_func
21
+ except ImportError:
22
+ warnings.warn(
23
+ "PLAMO3_MODELING_PLAMO_USE_FLASH_ATTENTION_FOR_POST_TRAINING is set but flash_attn is not installed. "
24
+ "Falling back to scaled_dot_product_attention. "
25
+ "Install it via `pip install flash-attn` to use Flash Attention.",
26
+ stacklevel=2,
27
+ )
28
+ USE_FLASH_ATTENTION_FOR_POST_TRAINING = False
29
+
30
+
31
+ def _swiglu(h: torch.Tensor) -> torch.Tensor:
32
+ h0, h1 = h.chunk(2, dim=-1)
33
+ return torch.nn.functional.silu(h0) * h1
34
+
35
+
36
+ class RotaryEmbedding(torch.nn.Module):
37
+ def __init__(
38
+ self, dim: int, max_position_embeddings: int = 2048, base: int = 10000, device: Optional[torch.device] = None
39
+ ) -> None:
40
+ super().__init__()
41
+
42
+ self.dim = dim
43
+ self.max_position_embeddings = max_position_embeddings
44
+ self.base = base
45
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
46
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
47
+
48
+ # Build here to make `torch.jit.trace` work.
49
+ self._set_cos_sin_cache(
50
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
51
+ )
52
+
53
+ def _set_cos_sin_cache(self, seq_len: int, device: Any, dtype: Any) -> None:
54
+ self.max_seq_len_cached = seq_len
55
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype) # type: ignore
56
+
57
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
58
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
59
+ emb = torch.cat((freqs, freqs), dim=-1)
60
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
61
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
62
+
63
+ def forward(self, x: torch.Tensor, seq_len: int) -> Tuple[torch.Tensor, torch.Tensor]:
64
+ # x: [bs, num_attention_heads, seq_len, head_size]
65
+ if seq_len > self.max_seq_len_cached:
66
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
67
+
68
+ return (
69
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype), # type: ignore
70
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype), # type: ignore
71
+ )
72
+
73
+
74
+ def _rotate_half(x: torch.Tensor) -> torch.Tensor:
75
+ """Rotates half the hidden dims of the input."""
76
+ x1 = x[..., : x.shape[-1] // 2]
77
+ x2 = x[..., x.shape[-1] // 2 :]
78
+ return torch.cat((-x2, x1), dim=-1)
79
+
80
+
81
+ def _rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: torch.Tensor) -> torch.Tensor:
82
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
83
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
84
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
85
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
86
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
87
+ x_embed = (x * cos) + (_rotate_half(x) * sin)
88
+ return x_embed
89
+
90
+
91
+ class LinearType(str, enum.Enum):
92
+ Normal = "normal"
93
+ Fp8 = "fp8"
94
+
95
+
96
+ def is_full_attn(sliding_window_pattern: int, layer_idx: int) -> bool:
97
+ return not bool((layer_idx + 1) % sliding_window_pattern)
98
+
99
+
100
+ class Plamo3Config(PretrainedConfig): # type: ignore
101
+ model_type: str = "plamo3"
102
+
103
+ def __init__(
104
+ self,
105
+ hidden_size: int = 4096,
106
+ num_hidden_layers: int = 32,
107
+ rms_norm_eps: float = 1e-6,
108
+ tie_word_embeddings: bool = True,
109
+ # Attention
110
+ num_attention_heads: int = 32,
111
+ num_key_value_heads: int = 4,
112
+ head_dim: int = 128,
113
+ max_position_embeddings: int = 2048,
114
+ window_size: int = 2048,
115
+ sliding_window_pattern: int = 8,
116
+ rope_theta: int = 1000000,
117
+ rope_local_theta: int = 10000,
118
+ # MLP
119
+ intermediate_size: int = 13312,
120
+ # Tokenizer
121
+ vocab_size: int = 32000,
122
+ tokenizer_class: str = "Plamo3Tokenizer",
123
+ pad_token_id: Optional[int] = None,
124
+ bos_token_id: int = 1,
125
+ eos_token_id: int = 2,
126
+ # Multimodal
127
+ image_token_id: Optional[int] = None,
128
+ image_feature_size: Optional[int] = None,
129
+ image_proj_type: Literal["linear", "mlp"] = "linear",
130
+ # FP8
131
+ linear_type: LinearType = LinearType.Normal,
132
+ # Evaluation
133
+ use_cache: bool = True,
134
+ **kwargs: Any,
135
+ ) -> None:
136
+ self.max_position_embeddings = max_position_embeddings
137
+ self.hidden_size = hidden_size
138
+ self.rms_norm_eps = rms_norm_eps
139
+
140
+ self.num_hidden_layers = num_hidden_layers
141
+ self.num_attention_heads = num_attention_heads
142
+ self.head_dim = head_dim
143
+ self.num_key_value_heads = num_key_value_heads
144
+ self.window_size = window_size
145
+ self.sliding_window_pattern = sliding_window_pattern
146
+ self.rope_theta = rope_theta
147
+ self.rope_local_theta = rope_local_theta
148
+
149
+ self.intermediate_size = intermediate_size
150
+
151
+ self.vocab_size = vocab_size
152
+
153
+ self.image_token_id = image_token_id
154
+ self.image_feature_size = image_feature_size
155
+ self.image_proj_type = image_proj_type
156
+
157
+ self.linear_type = linear_type
158
+
159
+ self.use_cache = use_cache
160
+
161
+ self.interleaved_sliding_window: list[int | None] = []
162
+ for i in range(self.num_hidden_layers):
163
+ if is_full_attn(self.sliding_window_pattern, i):
164
+ self.interleaved_sliding_window.append(None)
165
+ else:
166
+ self.interleaved_sliding_window.append(self.window_size)
167
+ assert len(self.interleaved_sliding_window) == self.num_hidden_layers
168
+
169
+ super().__init__(
170
+ tokenizer_class=tokenizer_class,
171
+ pad_token_id=pad_token_id,
172
+ bos_token_id=bos_token_id,
173
+ eos_token_id=eos_token_id,
174
+ tie_word_embeddings=tie_word_embeddings,
175
+ **kwargs,
176
+ )
177
+
178
+ @property
179
+ def layer_types(self) -> list[str]:
180
+ return [
181
+ "full_attention" if sliding_window_size is None else "sliding_attention"
182
+ for sliding_window_size in self.interleaved_sliding_window
183
+ ]
184
+
185
+ @property
186
+ def layers_block_type(self) -> list[str]:
187
+ return ["attention" for i in range(self.num_hidden_layers)]
188
+
189
+ @property
190
+ def rope_local_base_freq(self) -> int:
191
+ return self.rope_local_theta
192
+
193
+
194
+ class Plamo3Cache(DynamicCache): # type: ignore
195
+ def __init__(self, config: Plamo3Config) -> None:
196
+ super().__init__()
197
+ self.config = config
198
+
199
+ def finalize(self, layer_idx: int) -> None:
200
+ full_attn = self.config.layer_types[layer_idx] == "full_attention"
201
+ if full_attn:
202
+ return
203
+
204
+ window_size = self.config.window_size
205
+ assert self[layer_idx] is not None
206
+ key, value = self[layer_idx]
207
+ self.layers[layer_idx].keys = key[:, :, -window_size:, :]
208
+ self.layers[layer_idx].values = value[:, :, -window_size:, :]
209
+
210
+ def get_seq_length(self, layer_idx: Optional[int] = None) -> int:
211
+ if layer_idx is not None:
212
+ k, _ = self[layer_idx]
213
+ return k.shape[2] # type: ignore
214
+
215
+ sequence_length: int | None = None
216
+ for layer_cache in iter(self):
217
+ key = layer_cache[0]
218
+ sequence_length = max(key.shape[2], sequence_length) if sequence_length is not None else key.shape[2]
219
+ if sequence_length is None:
220
+ return 0
221
+ return sequence_length
222
+
223
+
224
+ class DecoderInput(NamedTuple):
225
+ hidden_states: torch.Tensor
226
+ attention_mask: Optional[torch.Tensor] = None
227
+ past_states: Optional[Plamo3Cache] = None
228
+ output_hidden_states: Optional[bool] = False
229
+ output_attentions: Optional[bool] = False
230
+ gradient_checkpointing: bool = False
231
+ input_ids: Optional[torch.Tensor] = None
232
+
233
+
234
+ class DecoderOutput(NamedTuple):
235
+ hidden_states: torch.Tensor
236
+ all_hidden_states: Optional[Tuple[torch.Tensor, ...]]
237
+ all_self_attns: Optional[Tuple[torch.Tensor, ...]]
238
+
239
+
240
+ def _make_causal_mask(
241
+ input_ids_shape: Tuple[int, int],
242
+ dtype: torch.dtype,
243
+ device: torch.device,
244
+ seq_len: int,
245
+ cache_position: torch.Tensor,
246
+ ) -> torch.Tensor:
247
+ """
248
+ Make causal mask used for bi-directional self-attention.
249
+
250
+ Follows the logic in `LlamaModel._prepare_4d_causal_attention_mask_with_cache_position`
251
+ https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L664
252
+
253
+ NOTE(murai): seq_len (sequence_length) and tgt_len(target_length) are swapped in the original code.
254
+ Our implementation:
255
+ - seq_len: the length of the sequences which is being processed as well as which have been processed
256
+ - tgt_len: the length of the sequences which is being processed
257
+
258
+ Original (Llama) implementation:
259
+ - sequence_length: "The sequence length being processed"
260
+ - target_length: "when generating with static cache, the mask should be as long as the static cache,
261
+ to account for the 0 padding, the part of the cache that is not filled yet."
262
+ """
263
+ bsz, tgt_len = input_ids_shape
264
+
265
+ mask = torch.full((tgt_len, seq_len), float("-inf"), device=device)
266
+ if tgt_len != 1:
267
+ # TODO(murai): is this necessary?
268
+ mask = torch.triu(mask, diagonal=1)
269
+ mask = torch.where(torch.arange(seq_len, device=device) > cache_position.reshape(-1, 1), mask, 0.0)
270
+ mask = mask.to(dtype)
271
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, seq_len)
272
+
273
+
274
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
275
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None) -> torch.Tensor:
276
+ """
277
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
278
+ """
279
+ bsz, src_len = mask.size()
280
+ tgt_len = tgt_len if tgt_len is not None else src_len
281
+
282
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
283
+
284
+ inverted_mask = 1.0 - expanded_mask
285
+
286
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), float("-inf")) # type: ignore
287
+
288
+
289
+ def _rms_norm(
290
+ hidden_states: torch.Tensor, weight: Optional[torch.Tensor], eps: float, offset: float = 1.0
291
+ ) -> torch.Tensor:
292
+ input_dtype = hidden_states.dtype
293
+ hidden_states = hidden_states.to(torch.float32)
294
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
295
+ hidden_states = hidden_states * torch.rsqrt(variance + eps)
296
+ hidden_states = hidden_states.to(input_dtype)
297
+ if weight is not None:
298
+ hidden_states = (offset + weight) * hidden_states
299
+ return hidden_states
300
+
301
+
302
+ class RMSNorm(nn.Module):
303
+ def __init__(
304
+ self,
305
+ hidden_size: int,
306
+ eps: float = 1e-6,
307
+ offset: float = 1.0,
308
+ device: Optional[Union[torch.device, str]] = None,
309
+ ) -> None:
310
+ super().__init__()
311
+ self.weight = nn.Parameter(torch.zeros(hidden_size, device=device))
312
+ self.variance_epsilon = eps
313
+ self.offset = offset
314
+
315
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
316
+ return _rms_norm(hidden_states, self.weight, self.variance_epsilon, offset=self.offset)
317
+
318
+
319
+ def swa_mask(q_len: int, kv_len: int, device: torch.device, window_size: int) -> torch.Tensor:
320
+ max_len = max(q_len, kv_len)
321
+ mask = (
322
+ torch.ones(max_len, max_len, dtype=torch.bool, device=device)
323
+ .triu(diagonal=-window_size)
324
+ .tril(diagonal=window_size)
325
+ )
326
+ return mask[-q_len:, -kv_len:]
327
+
328
+
329
+ class Attention(torch.nn.Module):
330
+ def __init__(self, config: Plamo3Config, layer_idx: int) -> None:
331
+ super().__init__()
332
+ self.config = config
333
+ self.layer_idx = layer_idx
334
+ self.hidden_size = config.hidden_size
335
+ head_dim = config.head_dim
336
+ self.max_position_embeddings = config.max_position_embeddings
337
+
338
+ self.q_num_heads = config.num_attention_heads
339
+ self.qk_dim = self.v_dim = head_dim
340
+ self.k_num_heads = self.v_num_heads = config.num_key_value_heads
341
+ assert self.q_num_heads % self.k_num_heads == 0
342
+ self.n_group = self.q_num_heads // self.k_num_heads
343
+
344
+ self.q_proj_dim = self.q_num_heads * self.qk_dim
345
+ self.k_proj_dim = self.k_num_heads * self.qk_dim
346
+ self.v_proj_dim = self.v_num_heads * self.v_dim
347
+ self.qkv_proj = nn.Linear(self.hidden_size, self.q_proj_dim + self.k_proj_dim + self.v_proj_dim, bias=False)
348
+ self.o_proj = nn.Linear(self.q_num_heads * self.v_dim, self.hidden_size, bias=False)
349
+
350
+ self.q_norm = RMSNorm(self.qk_dim, eps=self.config.rms_norm_eps, offset=1.0)
351
+ self.k_norm = RMSNorm(self.qk_dim, eps=self.config.rms_norm_eps, offset=1.0)
352
+
353
+ self.full_attn = config.layer_types[layer_idx] == "full_attention"
354
+ base = self.config.rope_theta if self.full_attn else self.config.rope_local_theta
355
+ self.rotary_emb = RotaryEmbedding(
356
+ self.qk_dim, max_position_embeddings=self.config.max_position_embeddings, base=base
357
+ )
358
+
359
+ def forward(
360
+ self,
361
+ hidden_states: torch.Tensor,
362
+ attention_mask: Optional[torch.Tensor] = None,
363
+ past_states: Optional[Plamo3Cache] = None,
364
+ output_attentions: bool = False,
365
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Plamo3Cache]]:
366
+ bsz, q_len, _ = hidden_states.size()
367
+
368
+ qkv = self.qkv_proj(hidden_states)
369
+ query_states, key_states, value_states = torch.split(
370
+ qkv, [self.q_proj_dim, self.k_proj_dim, self.v_proj_dim], dim=-1
371
+ )
372
+ query_states = query_states.view(bsz, q_len, self.q_num_heads, self.qk_dim).transpose(1, 2)
373
+ key_states = key_states.view(bsz, q_len, self.k_num_heads, self.qk_dim).transpose(1, 2)
374
+ value_states = value_states.view(bsz, q_len, self.v_num_heads, self.v_dim).transpose(1, 2)
375
+
376
+ attn_dtype = query_states.dtype
377
+
378
+ query_states = self.q_norm(query_states)
379
+ key_states = self.k_norm(key_states)
380
+
381
+ if past_states is not None:
382
+ key_states, value_states = past_states.update(key_states, value_states, self.layer_idx)
383
+ past_states.finalize(self.layer_idx)
384
+
385
+ kv_seq_len = key_states.shape[-2]
386
+ device = hidden_states.device
387
+ position_ids = torch.arange(kv_seq_len, dtype=torch.long, device=device)[None]
388
+ q_position_ids = position_ids[:, -query_states.shape[2] :]
389
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
390
+ query_states = _rotary_pos_emb(query_states, cos, sin, q_position_ids)
391
+ key_states = _rotary_pos_emb(key_states, cos, sin, position_ids)
392
+ # [bsz, nh, t, hd]
393
+
394
+ query_states = query_states.to(attn_dtype)
395
+ key_states = key_states.to(attn_dtype)
396
+ value_states = value_states.to(attn_dtype)
397
+ if attention_mask is not None and attention_mask.dtype != torch.bool:
398
+ attention_mask = attention_mask.to(attn_dtype)
399
+
400
+ if USE_FLASH_ATTENTION_FOR_POST_TRAINING:
401
+ # It is assumed that there's no padding on the left side.
402
+ # attention_mask is ignored.
403
+ if self.full_attn:
404
+ attn_output = F.scaled_dot_product_attention(
405
+ query_states, key_states, value_states, is_causal=True, enable_gqa=True
406
+ )
407
+ else:
408
+ # Use Flash Attention for sliding window attention
409
+ # Flash attention output is (N, L, H, C), transpose to (N, H, L, C) for consistency
410
+ attn_output = flash_attn_func(
411
+ query_states.transpose(1, 2),
412
+ key_states.transpose(1, 2),
413
+ value_states.transpose(1, 2),
414
+ window_size=(self.config.window_size, 0),
415
+ causal=True,
416
+ ).transpose(1, 2)
417
+ elif attention_mask is None:
418
+ assert self.full_attn or key_states.shape[2] <= self.config.window_size + 1
419
+ attn_output = F.scaled_dot_product_attention(
420
+ query_states, key_states, value_states, is_causal=True, enable_gqa=True
421
+ )
422
+ else:
423
+ if attention_mask.dtype == torch.bool:
424
+ attention_mask = torch.where(attention_mask, torch.tensor(0.0, dtype=torch.float), float("-inf"))
425
+ if len(attention_mask.shape) == 2:
426
+ attention_mask = attention_mask[None, None]
427
+ assert len(attention_mask.shape) == 4
428
+
429
+ if not self.full_attn:
430
+ m_swa = swa_mask(
431
+ query_states.shape[2], key_states.shape[2], query_states.device, self.config.window_size
432
+ )
433
+ # `generate` function creates attention mask that does not consider sliding window
434
+ m_swa = m_swa[None, None]
435
+ attention_mask = attention_mask[:, :, -query_states.shape[2] :, -key_states.shape[2] :]
436
+ attention_mask = torch.where(m_swa, attention_mask, float("-inf"))
437
+
438
+ # like AttentionMaskConverter._unmask_unattended in huggingface.transfoermers,
439
+ # we need to attend to all tokens in masked rows for `scaled_dot_product_attention`
440
+ bool_mask = torch.logical_not(torch.isneginf(attention_mask))
441
+ valid_tokens = torch.sum(bool_mask, dim=-1).bool() # (..., q_len)
442
+ attention_mask = torch.where(valid_tokens[..., None], attention_mask, float(0.0))
443
+ attn_output = F.scaled_dot_product_attention(
444
+ query_states,
445
+ key_states,
446
+ value_states,
447
+ attn_mask=attention_mask,
448
+ enable_gqa=True,
449
+ )
450
+
451
+ attn_output = attn_output.transpose(1, 2)
452
+
453
+ attn_output = attn_output.reshape(bsz, q_len, self.q_num_heads * self.v_dim)
454
+ attn_output = self.o_proj(attn_output)
455
+
456
+ if not output_attentions:
457
+ attn_weights = None
458
+
459
+ return attn_output, attn_weights, past_states
460
+
461
+
462
+ class MLP(nn.Module):
463
+ def __init__(self, config: Plamo3Config) -> None:
464
+ super().__init__()
465
+ self.config = config
466
+ self.hidden_size = config.hidden_size
467
+ self.intermediate_size = config.intermediate_size
468
+ self.gate_up_proj = torch.nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
469
+ self.down_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
470
+
471
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
472
+ h = self.gate_up_proj(x)
473
+ h = _swiglu(h)
474
+ return self.down_proj(h) # type: ignore
475
+
476
+
477
+ class Plamo3DecoderLayer(torch.nn.Module):
478
+ def __init__(self, config: Plamo3Config, layer_idx: int) -> None:
479
+ super().__init__()
480
+ self.config = config
481
+ self.hidden_size = config.hidden_size
482
+ self.mixer: torch.nn.Module
483
+ self.mixer = Attention(config, layer_idx)
484
+ self.mlp = MLP(config)
485
+ """
486
+ Notes: The model performance was degraded when setting all offsets to 1.
487
+ """
488
+ self.pre_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
489
+ self.post_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / 5)
490
+ self.pre_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
491
+ self.post_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / (5**1.5))
492
+
493
+ def forward(
494
+ self,
495
+ hidden_states: torch.Tensor,
496
+ attention_mask: Optional[torch.Tensor] = None,
497
+ past_state: Optional[Plamo3Cache] = None,
498
+ output_attentions: Optional[bool] = False,
499
+ ) -> Tuple[Any, ...]:
500
+ # from LlamaDecoder
501
+ residual = hidden_states
502
+ hidden_states = self.pre_mixer_norm(hidden_states)
503
+
504
+ # Self Attention
505
+ hidden_states_sa, self_attn_weights, present_key_value = self.mixer(
506
+ hidden_states=hidden_states,
507
+ attention_mask=attention_mask,
508
+ past_states=past_state,
509
+ output_attentions=output_attentions,
510
+ )
511
+
512
+ hidden_states_sa = self.post_mixer_norm(hidden_states_sa)
513
+ hidden_states = residual + hidden_states_sa
514
+
515
+ residual = hidden_states
516
+ hidden_states = self.pre_mlp_norm(hidden_states)
517
+
518
+ # Fully Connected
519
+ hidden_states_mlp = self.mlp(hidden_states)
520
+
521
+ # Residual
522
+ hidden_states_mlp = self.post_mlp_norm(hidden_states_mlp)
523
+ hidden_states = residual + hidden_states_mlp
524
+
525
+ outputs: Any = (hidden_states,)
526
+
527
+ if output_attentions:
528
+ outputs += (self_attn_weights,)
529
+
530
+ return outputs # type: ignore
531
+
532
+
533
+ class Plamo3Decoder(torch.nn.Module):
534
+ def __init__(self, config: Plamo3Config) -> None:
535
+ super().__init__()
536
+
537
+ self.layers = torch.nn.ModuleList(
538
+ [Plamo3DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)]
539
+ )
540
+ self.gradient_checkpointing = False
541
+
542
+ def forward(self, x: DecoderInput) -> DecoderOutput:
543
+ all_hidden_states: Optional[Tuple[torch.Tensor, ...]] = () if x.output_hidden_states else None
544
+ all_self_attns: Optional[Tuple[torch.Tensor, ...]] = () if x.output_attentions else None
545
+ hidden_states = x.hidden_states
546
+
547
+ for decoder_layer in self.layers:
548
+ if x.output_hidden_states:
549
+ assert all_hidden_states is not None
550
+ all_hidden_states += (hidden_states,)
551
+
552
+ if self.training and x.gradient_checkpointing:
553
+ layer_outputs = self._gradient_checkpointing_func( # type: ignore
554
+ decoder_layer.__call__,
555
+ hidden_states,
556
+ x.attention_mask,
557
+ x.past_states,
558
+ x.output_attentions,
559
+ )
560
+ else:
561
+ layer_outputs = decoder_layer(
562
+ hidden_states,
563
+ attention_mask=x.attention_mask,
564
+ past_state=x.past_states,
565
+ output_attentions=x.output_attentions,
566
+ )
567
+
568
+ hidden_states = layer_outputs[0]
569
+
570
+ if x.output_attentions:
571
+ assert layer_outputs[1] is not None
572
+ assert all_self_attns is not None
573
+ all_self_attns += (layer_outputs[1],)
574
+ return DecoderOutput(hidden_states, all_hidden_states, all_self_attns)
575
+
576
+
577
+ class Plamo3PreTrainedModel(PreTrainedModel): # type: ignore
578
+ config_class = Plamo3Config
579
+ _no_split_modules: List[str]
580
+ base_model_prefix = "model"
581
+ supports_gradient_checkpointing = True
582
+ _no_split_modules = ["PlamoDecoderLayer"]
583
+ _skip_keys_device_placement = "past_key_values"
584
+ _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
585
+
586
+ def _init_weights(self, module: torch.nn.Module) -> None:
587
+ std = 0.02
588
+ if isinstance(module, nn.Linear):
589
+ module.weight.data.normal_(mean=0.0, std=std)
590
+ if module.bias is not None:
591
+ module.bias.data.zero_()
592
+ elif isinstance(module, nn.Embedding):
593
+ module.weight.data.normal_(mean=0.0, std=std)
594
+ if module.padding_idx is not None:
595
+ module.weight.data[module.padding_idx].zero_()
596
+
597
+
598
+ class Plamo3Model(Plamo3PreTrainedModel):
599
+ def __init__(self, config: Plamo3Config):
600
+ super().__init__(config)
601
+ self.padding_idx = config.pad_token_id
602
+ self.vocab_size = config.vocab_size
603
+
604
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
605
+ if config.image_feature_size is not None:
606
+ if config.image_proj_type == "mlp":
607
+ self.image_proj = MLPImageProjector(config) # type: ignore
608
+ elif config.image_proj_type == "linear":
609
+ self.image_proj = nn.Linear(config.image_feature_size, config.hidden_size, bias=False) # type: ignore
610
+ else:
611
+ raise ValueError(f"Unknown image_proj_type: {config.image_proj_type}")
612
+ self.layers = Plamo3Decoder(config) # type: ignore
613
+ self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
614
+
615
+ self.gradient_checkpointing = False
616
+ # Initialize weights and apply final processing
617
+ self.post_init()
618
+
619
+ def get_input_embeddings(self) -> torch.nn.Embedding:
620
+ return self.embed_tokens
621
+
622
+ def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
623
+ self.embed_tokens = value
624
+
625
+ def _prepare_decoder_attention_mask(
626
+ self,
627
+ attention_mask: torch.Tensor,
628
+ input_shape: Tuple[int, int],
629
+ inputs_embeds: torch.Tensor,
630
+ cache_position: torch.LongTensor,
631
+ ) -> Optional[torch.Tensor]:
632
+ # create causal mask
633
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
634
+ combined_attention_mask = _make_causal_mask(
635
+ input_shape,
636
+ inputs_embeds.dtype,
637
+ device=inputs_embeds.device,
638
+ seq_len=attention_mask.shape[-1],
639
+ cache_position=cache_position,
640
+ )
641
+ input_shape = (input_shape[0], combined_attention_mask.shape[2])
642
+
643
+ if attention_mask.dim() == 4:
644
+ # Custom 4D attention mask
645
+ expanded_attn_mask = attention_mask
646
+ else:
647
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
648
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
649
+ inputs_embeds.device
650
+ )
651
+ combined_attention_mask = (
652
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
653
+ )
654
+
655
+ return combined_attention_mask
656
+
657
+ def forward(
658
+ self,
659
+ input_ids: Optional[torch.LongTensor] = None,
660
+ attention_mask: Optional[torch.Tensor] = None,
661
+ position_ids: Optional[torch.Tensor] = None,
662
+ past_key_values: Optional[Plamo3Cache | DynamicCache] = None,
663
+ inputs_embeds: Optional[torch.Tensor] = None,
664
+ image_features: Optional[torch.Tensor] = None,
665
+ use_cache: Optional[bool] = None,
666
+ output_attentions: Optional[bool] = None,
667
+ output_hidden_states: Optional[bool] = None,
668
+ cache_position: Optional[torch.LongTensor] = None,
669
+ **kwargs: Any,
670
+ ) -> BaseModelOutputWithPast:
671
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
672
+ output_hidden_states = (
673
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
674
+ )
675
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
676
+
677
+ # retrieve input_ids and inputs_embeds
678
+ if (input_ids is None) ^ (inputs_embeds is not None):
679
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
680
+
681
+ if self.gradient_checkpointing and self.training and use_cache:
682
+ use_cache = False
683
+
684
+ if inputs_embeds is None:
685
+ inputs_embeds = self.embed_tokens(input_ids)
686
+ batch_size, seq_length, _ = inputs_embeds.shape
687
+
688
+ seq_length_with_past = seq_length
689
+ past_key_values_length = 0
690
+ if past_key_values is not None:
691
+ # In some `transformers` versions, `past_key_values` may be a `DynamicCache` object.
692
+ if not isinstance(past_key_values, Plamo3Cache):
693
+ past_key_values_prev = past_key_values
694
+ past_key_values = Plamo3Cache(self.config)
695
+ for layer_idx in range(len(past_key_values_prev)):
696
+ layer = past_key_values_prev.layers[layer_idx]
697
+ if layer.keys is not None and layer.values is not None:
698
+ past_key_values.update(layer.keys, layer.values, layer_idx=layer_idx)
699
+ assert isinstance(past_key_values, Plamo3Cache)
700
+ past_key_values_length = past_key_values.get_seq_length()
701
+ seq_length_with_past = seq_length_with_past + past_key_values_length
702
+
703
+ if cache_position is None:
704
+ cache_position = torch.arange(
705
+ past_key_values_length,
706
+ past_key_values_length + seq_length,
707
+ device=inputs_embeds.device,
708
+ ) # type: ignore
709
+
710
+ if image_features is not None:
711
+ assert self.config.image_token_id is not None
712
+ image_embeds = self.image_proj(image_features)
713
+ assert image_embeds.shape == inputs_embeds.shape, (image_embeds.shape, inputs_embeds.shape)
714
+ mask = input_ids == self.config.image_token_id
715
+ inputs_embeds[mask] = image_embeds[mask]
716
+
717
+ # embed positions
718
+ require_attn_mask = False
719
+ if not self.training or past_key_values is not None:
720
+ require_attn_mask = True
721
+ if seq_length_with_past > self.config.window_size + 1:
722
+ require_attn_mask = True
723
+ if require_attn_mask and attention_mask is None:
724
+ attention_mask = torch.ones(
725
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
726
+ )
727
+ if attention_mask is not None:
728
+ attention_mask = self._prepare_decoder_attention_mask(
729
+ attention_mask,
730
+ (batch_size, seq_length),
731
+ inputs_embeds,
732
+ cache_position, # type: ignore
733
+ )
734
+
735
+ hidden_states = inputs_embeds
736
+
737
+ if use_cache and past_key_values is None:
738
+ past_key_values = Plamo3Cache(self.config)
739
+
740
+ # decoder layers
741
+ out = self.layers(
742
+ DecoderInput(
743
+ hidden_states,
744
+ attention_mask,
745
+ past_key_values,
746
+ output_hidden_states,
747
+ output_attentions,
748
+ self.gradient_checkpointing,
749
+ )
750
+ )
751
+ assert isinstance(out, DecoderOutput)
752
+ hidden_states = out.hidden_states
753
+ all_hidden_states = out.all_hidden_states
754
+ all_self_attns = out.all_self_attns
755
+
756
+ hidden_states = self.norm(hidden_states)
757
+
758
+ # add hidden states from the last decoder layer
759
+ if output_hidden_states:
760
+ assert all_hidden_states is not None
761
+ all_hidden_states += (hidden_states,)
762
+
763
+ return BaseModelOutputWithPast(
764
+ last_hidden_state=hidden_states,
765
+ past_key_values=past_key_values,
766
+ hidden_states=all_hidden_states,
767
+ attentions=all_self_attns,
768
+ )
769
+
770
+
771
+ class Plamo3ForCausalLM(Plamo3PreTrainedModel, GenerationMixin): # type: ignore
772
+ _tied_weights_keys = ["lm_head.weight"]
773
+
774
+ # Without this, the model cannot be loaded into a meta device.
775
+ # Relevant code:
776
+ # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L4376-L4381
777
+ # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L356
778
+ # https://github.com/pytorch/pytorch/blob/v2.4.1/torch/nn/modules/module.py#L2068
779
+ _supports_param_buffer_assignment = False
780
+
781
+ def __init__(self, config: Plamo3Config) -> None:
782
+ super().__init__(config)
783
+ self.model = Plamo3Model(config)
784
+
785
+ self.vocab_size = config.vocab_size
786
+ vocab_size = ((self.vocab_size + 15) // 16) * 16
787
+ self.lm_head: torch.nn.Module = nn.Linear(config.hidden_size, vocab_size, bias=False)
788
+
789
+ # Initialize weights and apply final processing
790
+ self.post_init()
791
+
792
+ def get_input_embeddings(self) -> torch.nn.Embedding:
793
+ return self.model.embed_tokens
794
+
795
+ def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
796
+ self.model.embed_tokens = value
797
+
798
+ def get_output_embeddings(self) -> torch.nn.Module:
799
+ return self.lm_head
800
+
801
+ def set_output_embeddings(self, new_embeddings: torch.nn.Module) -> None:
802
+ self.lm_head = new_embeddings
803
+
804
+ def set_decoder(self, decoder: Plamo3Model) -> None:
805
+ self.model = decoder
806
+
807
+ def get_decoder(self) -> Plamo3Model:
808
+ return self.model
809
+
810
+ def forward( # type: ignore
811
+ self,
812
+ input_ids: Optional[torch.LongTensor] = None,
813
+ attention_mask: Optional[torch.Tensor] = None,
814
+ position_ids: Optional[torch.Tensor] = None,
815
+ past_key_values: Optional[Plamo3Cache] = None,
816
+ inputs_embeds: Optional[torch.FloatTensor] = None,
817
+ image_features: Optional[torch.Tensor] = None,
818
+ labels: Optional[torch.LongTensor] = None,
819
+ use_cache: Optional[bool] = None,
820
+ output_attentions: Optional[bool] = None,
821
+ output_hidden_states: Optional[bool] = None,
822
+ cache_position: Optional[torch.LongTensor] = None,
823
+ logits_to_keep: int | torch.Tensor = 0,
824
+ **kwargs: Any,
825
+ ) -> CausalLMOutputWithPast:
826
+ r"""
827
+ Args:
828
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
829
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
830
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
831
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
832
+
833
+ Returns:
834
+
835
+ Example:
836
+
837
+ ```python
838
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
839
+
840
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
841
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
842
+
843
+ >>> prompt = "Hey, are you consciours? Can you talk to me?"
844
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
845
+
846
+ >>> # Generate
847
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
848
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
849
+ "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
850
+ ```"""
851
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
852
+ output_hidden_states = (
853
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
854
+ )
855
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
856
+ outputs = self.model(
857
+ input_ids=input_ids,
858
+ attention_mask=attention_mask,
859
+ position_ids=position_ids,
860
+ past_key_values=past_key_values,
861
+ inputs_embeds=inputs_embeds,
862
+ image_features=image_features,
863
+ use_cache=use_cache,
864
+ output_attentions=output_attentions,
865
+ output_hidden_states=output_hidden_states,
866
+ cache_position=cache_position,
867
+ **kwargs,
868
+ )
869
+
870
+ hidden_states = outputs[0]
871
+ logits = self.lm_head(hidden_states)
872
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
873
+ logits = logits[:, slice_indices, : self.vocab_size]
874
+
875
+ loss = None
876
+ if labels is not None:
877
+ if len(kwargs) > 0 and set(kwargs.keys()) != set(["ignore_index"]):
878
+ warnings.warn(
879
+ f"The following kwargs may not be supported: {', '.join(kwargs.keys())}. ",
880
+ stacklevel=2,
881
+ )
882
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
883
+
884
+ return CausalLMOutputWithPast(
885
+ loss=loss,
886
+ logits=logits,
887
+ past_key_values=outputs.past_key_values,
888
+ hidden_states=outputs.hidden_states,
889
+ attentions=outputs.attentions,
890
+ )
891
+
892
+ def prepare_inputs_for_generation(
893
+ self,
894
+ input_ids: torch.Tensor,
895
+ past_key_values: Optional[Plamo3Cache] = None,
896
+ attention_mask: Optional[torch.Tensor] = None,
897
+ inputs_embeds: Optional[torch.Tensor] = None,
898
+ image_features: Optional[torch.Tensor] = None,
899
+ **kwargs: Any,
900
+ ) -> Dict[str, Any]:
901
+ if past_key_values and all(k.keys is not None for k in past_key_values.layers):
902
+ input_ids = input_ids[:, -1:]
903
+ if image_features is not None:
904
+ image_features = image_features[:, -1:, :]
905
+
906
+ position_ids = kwargs.get("position_ids", None)
907
+ if attention_mask is not None and position_ids is None:
908
+ # create position_ids on the fly for batch generation
909
+ position_ids = attention_mask.long().cumsum(-1) - 1
910
+ position_ids.masked_fill_(attention_mask == 0, 1)
911
+ if past_key_values:
912
+ position_ids = position_ids[:, -1].unsqueeze(-1)
913
+
914
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
915
+ if inputs_embeds is not None and past_key_values is None:
916
+ model_inputs: Dict[str, Any] = {"inputs_embeds": inputs_embeds}
917
+ else:
918
+ model_inputs = {"input_ids": input_ids}
919
+
920
+ model_inputs.update(
921
+ {
922
+ "position_ids": position_ids,
923
+ "past_key_values": past_key_values,
924
+ "use_cache": kwargs.get("use_cache"),
925
+ "output_attentions": kwargs.get("output_attentions"),
926
+ "output_hidden_states": kwargs.get("output_hidden_states"),
927
+ "logits_to_keep": kwargs.get("logits_to_keep"),
928
+ "attention_mask": attention_mask,
929
+ "image_features": image_features,
930
+ }
931
+ )
932
+ return model_inputs
933
+
934
+ @staticmethod
935
+ def _reorder_cache(past_key_values: Plamo3Cache, beam_idx: torch.Tensor) -> Plamo3Cache:
936
+ past_key_values.reorder_cache(beam_idx)
937
+ return past_key_values
938
+
939
+
940
+ class MLPImageProjector(nn.Module):
941
+ def __init__(self, config: Plamo3Config) -> None:
942
+ super().__init__()
943
+ self.config = config
944
+
945
+ assert config.image_feature_size is not None # for typing
946
+
947
+ # nn.LayerNorm is not supported by PFVM, so use RMSNorm + Bias instead to approximate this.
948
+ self.norm0 = RMSNorm(config.image_feature_size, eps=config.rms_norm_eps)
949
+ self.bias0 = Bias(config.image_feature_size)
950
+
951
+ # PFVM doesn't support Linear with bias, so add bias manually afterwards.
952
+ self.linear1 = nn.Linear(config.image_feature_size, config.hidden_size, bias=False)
953
+ self.bias1 = Bias(config.hidden_size)
954
+ self.act1 = nn.GELU()
955
+
956
+ self.linear2 = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
957
+ self.bias2 = Bias(config.hidden_size)
958
+
959
+ def forward(
960
+ self,
961
+ hidden_states: torch.Tensor,
962
+ ) -> torch.Tensor:
963
+ hidden_states = self.norm0(hidden_states)
964
+ hidden_states = self.bias0(hidden_states)
965
+
966
+ hidden_states = self.linear1(hidden_states)
967
+ hidden_states = self.bias1(hidden_states)
968
+ hidden_states = self.act1(hidden_states)
969
+
970
+ hidden_states = self.linear2(hidden_states)
971
+ hidden_states = self.bias2(hidden_states)
972
+
973
+ return hidden_states
974
+
975
+
976
+ class Bias(nn.Module):
977
+ def __init__(self, num_features: int) -> None:
978
+ super().__init__()
979
+ self._bias = nn.Parameter(torch.zeros((num_features,)))
980
+
981
+ def forward(
982
+ self,
983
+ x: torch.Tensor,
984
+ ) -> torch.Tensor:
985
+ return x + self._bias
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|plamo:bos|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|plamo:eos|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|plamo:pad|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|plamo:unk|>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenization_plamo.py ADDED
@@ -0,0 +1,464 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import math
3
+ import os
4
+ import re
5
+ from shutil import copyfile
6
+ from typing import Any
7
+
8
+ import numpy as np
9
+
10
+ # NOTE: numba does not support type hints for njit: https://github.com/python/mypy/issues/16149
11
+ from numba import njit # type: ignore[attr-defined]
12
+ from numba.core import types # type: ignore[import-untyped]
13
+ from numba.typed import Dict
14
+ from transformers.tokenization_utils import PreTrainedTokenizer
15
+ from transformers.utils import logging
16
+
17
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.jsonl"}
18
+ logger = logging.get_logger(__name__)
19
+
20
+ INVALID_SCORE = -20000000
21
+ UNKNOWN_SCORE = -10000000
22
+
23
+ TABLE_PIECE_LENGTH = 0
24
+ TABLE_TOKEN_ID = 1
25
+ TABLE_SCORE = 2
26
+ TABLE_PIECE_ID = 3
27
+
28
+ PATH_TOKEN_LENGTH = 0
29
+ PATH_TOKEN_ID = 1
30
+ PATH_NUM_TOKENS = 2
31
+
32
+ # In Unicode, the character U+EE00 is a private use character that is not assigned to any specific
33
+ # character. This is internally used as a boundary character in tokenization.
34
+ BOUNDARY_CHAR = "\uee00"
35
+ BOUNDARY_TOKEN_ID = 10000000
36
+
37
+
38
+ class AhoCorasick:
39
+ def __init__(self) -> None:
40
+ # List of tokens in the vocabulary.
41
+ self._tokens: list[str]
42
+
43
+ # A mapping from a byte code point to a token ID, used for byte fallback.
44
+ self._bytes: np.ndarray
45
+
46
+ # A mapping from a suffix's piece code to a suffix ID.
47
+ #
48
+ # Typically, the Aho-Corasick algorithm builds a Trie and adds suffix links between nodes
49
+ # of the Trie. In this implementation, a suffix ID corresponds to a node in the trie, and
50
+ # a piece code to an edge (in other words, a pair of a node and the next character).
51
+ #
52
+ # A piece code is a 64-bit integer:
53
+ # - The upper 32 bits store the Unicode code point of the first character.
54
+ # - The lower 32 bits store the suffix ID of the remaining suffix.
55
+ #
56
+ # A suffix ID is an integer indicating the starting position in the _table.
57
+ self._to_suffix_id: dict[np.int64, np.int32] # numba.typed.Dict if jit is enabled
58
+
59
+ # Flattened table representing the Trie structure for the Aho-Corasick algorithm.
60
+ # It stores information including scores for each piece (prefix) within each suffix.
61
+ # It is flattened for memory efficiency and performance. Suffixes are stored in
62
+ # lexicographical order of their reversed strings, which improves memory access locality
63
+ # when exploring new characters starting from the string's end. Pieces within a suffix are
64
+ # stored in the decreasing order of their lengths.
65
+ #
66
+ # Each piece (a prefix fo the suffix) contains four pieces of information:
67
+ # - TABLE_PIECE_LENGTH: Length of the piece.
68
+ # - TABLE_TOKEN_ID: Token ID (or -1 if the piece is not a valid token).
69
+ # - TABLE_SCORE: Score (or INVALID_SCORE if the piece is not a valid token).
70
+ # - TABLE_PIECE_ID: Piece ID of the suffix.
71
+ #
72
+ # Each suffix also includes a sentinel row with a length of 1, a score of UNKNOWN_SCORE,
73
+ # and a token ID of -1. Sentinel rows are identified by the score being UNKNOWN_SCORE.
74
+ self._table: np.ndarray
75
+
76
+ # Regular expression matcher for identifying special tokens in the format <|plamo:*|>.
77
+ # Used to split text around special tokens during tokenization preprocessing.
78
+ self._sp_token_matcher: re.Pattern[str] | None = None
79
+
80
+ # Preprocessor to prevent boundary shifts in Unigram tokenization.
81
+ # The global DP in Unigram can create long lookahead dependencies, causing token boundaries
82
+ # to shift unexpectedly based on later context. While various sequences can trigger this,
83
+ # the most common culprits are long runs of spaces or repeated characters. This matcher
84
+ # finds sequences of two or more spaces or any character repeated four or more times, and
85
+ # forces hard splits immediately before and after each match, treating the span as its own
86
+ # token. By explicitly marking these boundaries, we eliminate most boundary jitter without
87
+ # trying to cover every rare case.
88
+ self._matcher: re.Pattern[str] | None = None
89
+
90
+ def build(
91
+ self,
92
+ vocab: list[Any],
93
+ *,
94
+ break_around_consecutive_spaces_threshold: int | None = None,
95
+ break_around_repeated_chars_threshold: int | None = None,
96
+ ) -> None:
97
+ """Build the Aho-Corasick data structure from vocabulary.
98
+
99
+ Args:
100
+ vocab: List of vocabulary entries, where each entry is [token, score, type, ...].
101
+ break_around_consecutive_spaces_threshold: Minimum number of consecutive spaces to trigger boundary splits.
102
+ If None, consecutive spaces won't trigger splits.
103
+ break_around_repeated_chars_threshold: Minimum number of repeated characters to trigger boundary splits.
104
+ If None, repeated characters won't trigger splits.
105
+ """
106
+ self._bytes = np.zeros(256, dtype=np.int32)
107
+ self._to_suffix_id = Dict.empty(key_type=types.int64, value_type=types.int32)
108
+
109
+ # Build suffix_to_score and token_to_token_id.
110
+ # The suffix_to_score dictionary maps a suffix to its score. It also includes all suffixes
111
+ # of the token for the Trie structure for the Aho-Corasick algorithm. If a suffix is not a
112
+ # valid token, its score is set to math.nan.
113
+ # The token_to_token_id dictionary maps a token to its token ID.
114
+ suffix_to_score: dict[str, float] = {}
115
+ token_to_token_id: dict[str, int] = {}
116
+ self._tokens = []
117
+ for token_id, row in list(enumerate(vocab)) + [(BOUNDARY_TOKEN_ID, [BOUNDARY_CHAR, 0, "CONTROL"])]:
118
+ assert isinstance(row[0], str), row
119
+ assert isinstance(row[1], (int, float)), row
120
+
121
+ token = str(row[0])
122
+ self._tokens.append(token)
123
+ token_to_token_id[token] = token_id
124
+
125
+ # Special handling for byte tokens.
126
+ if len(row) > 2 and row[2] == "BYTE":
127
+ assert len(token) == 6 and token.startswith("<0x") and token.endswith(">"), row[0]
128
+ self._bytes[int(row[0][3:5], 16)] = token_id
129
+ continue
130
+
131
+ suffix_to_score[token] = float(row[1])
132
+ # Ensure that all suffixes are included in suffix_to_score.
133
+ for i in range(1, len(token)):
134
+ suffix_to_score[token[i:]] = suffix_to_score.get(token[i:], math.nan)
135
+
136
+ # Ensure all byte tokens are set.
137
+ for i in range(256):
138
+ assert self._bytes[i] != 0, f"Byte token for <0x{i:02X}> is not set."
139
+
140
+ # Build a matcher for special tokens.
141
+ self._sp_token_matcher = re.compile(r"(<\|plamo:[^|\s]{,64}\|>)")
142
+
143
+ # Build matcher pattern to prevent boundary shifts.
144
+ patterns = []
145
+ if break_around_repeated_chars_threshold is not None:
146
+ patterns.append(f"(.)\\2{{{break_around_repeated_chars_threshold - 1},}}")
147
+ if break_around_consecutive_spaces_threshold is not None:
148
+ patterns.append(f" {{{break_around_consecutive_spaces_threshold},}}")
149
+ self._matcher = re.compile(f"({'|'.join(patterns)})") if patterns else None
150
+
151
+ # List suffixes in lexicographical order of their reversed strings.
152
+ suffixes = list(suffix_to_score.keys())
153
+ suffixes.append("")
154
+ suffixes.sort(key=lambda x: x[::-1])
155
+
156
+ # Build suffix_to_id, which is a mapping from a suffix to a suffix ID, and _to_suffix_id,
157
+ # which is a mapping from a piece code to a suffix ID.
158
+ suffix_to_id: dict[str, int] = {}
159
+ num_pieces = 0
160
+ for s in suffixes:
161
+ suffix_to_id[s] = num_pieces
162
+ if s != "":
163
+ self._to_suffix_id[
164
+ ord(s[0]) << 32 | suffix_to_id[s[1:]] # type: ignore[index] # cast int to np.int64
165
+ ] = np.int32(num_pieces)
166
+ num_pieces += 1 + sum(s[:i] in suffix_to_score for i in range(1, len(s) + 1))
167
+ assert suffix_to_id[""] == 0, suffix_to_id[""]
168
+
169
+ # Build _table, which is a flattened table representing the Trie structure for the Aho-Corasick.
170
+ self._table = np.zeros((num_pieces, 4), dtype=np.int32)
171
+ i = 0
172
+ for suffix in suffixes:
173
+ # Add all prefixes of the suffix to the table.
174
+ for piece_length in range(len(suffix), 0, -1):
175
+ piece = suffix[:piece_length]
176
+ score = suffix_to_score.get(piece, None)
177
+ if score is None:
178
+ continue
179
+ self._table[i, TABLE_PIECE_LENGTH] = piece_length
180
+ self._table[i, TABLE_TOKEN_ID] = token_to_token_id.get(piece, -1)
181
+ self._table[i, TABLE_SCORE] = round(score * 1e4) if math.isfinite(score) else INVALID_SCORE
182
+ self._table[i, TABLE_PIECE_ID] = suffix_to_id[piece]
183
+ i += 1
184
+
185
+ # Add a sentinel row.
186
+ self._table[i, TABLE_PIECE_LENGTH] = 1
187
+ self._table[i, TABLE_TOKEN_ID] = -1
188
+ self._table[i, TABLE_SCORE] = UNKNOWN_SCORE
189
+ i += 1
190
+ assert i == num_pieces, (i, num_pieces)
191
+
192
+ @staticmethod
193
+ @njit # type: ignore[misc] # untyped decorator
194
+ def _encode(
195
+ to_suffix_id: dict[np.int64, np.int32], # numba.typed.Dict if jit is enabled
196
+ table: np.ndarray,
197
+ bytes: np.ndarray,
198
+ data: np.ndarray,
199
+ ) -> np.ndarray:
200
+ # Initialize scores array with a high value and set the score at the end to 0.
201
+ # This array keeps track of the minimum cost (best score) to encode from each position to the end.
202
+ scores = np.full((len(data) + 1,), 2**60, dtype=np.int64)
203
+ scores[-1] = 0
204
+
205
+ # Path array to store the best path information.
206
+ # The path array keeps track of token length, token ID, and number of tokens needed to encode.
207
+ path = np.zeros((len(data) + 1, 3), dtype=np.int32)
208
+
209
+ # Initialize suffix_id to 0, which represents the root of the Trie.
210
+ suffix_id = np.int32(0)
211
+
212
+ # Process the input data from the end to the beginning.
213
+ for i in range(len(data) - 1, -1, -1):
214
+ c: np.int32 = data[i]
215
+
216
+ # Find the next suffix ID by iterating the suffix IDs of prefixes of the current suffix.
217
+ # NOTE: If no suffix ID is found, suffix_id will be set to 0.
218
+ for p in range(suffix_id, len(table)):
219
+ suffix_id = to_suffix_id.get(np.int64(c) << 32 | table[p, TABLE_PIECE_ID], np.int32(0))
220
+ # If a next suffix ID is found or a sentinel row is reached, break the loop.
221
+ if suffix_id > 0 or table[p, TABLE_SCORE] == UNKNOWN_SCORE:
222
+ break
223
+
224
+ # Update the best path to the current position. If multiple paths have the same score,
225
+ # this chooses the longest prefix as the best path (table is sorted in the decreasing
226
+ # order of piece length).
227
+ for p in range(suffix_id, len(table)):
228
+ score = table[p, TABLE_SCORE]
229
+ if score > INVALID_SCORE:
230
+ piece_length = table[p, TABLE_PIECE_LENGTH]
231
+ s = scores[i + piece_length] - score
232
+ if s < scores[i]:
233
+ scores[i] = s
234
+ path[i, PATH_TOKEN_LENGTH] = piece_length
235
+ path[i, PATH_TOKEN_ID] = table[p, TABLE_TOKEN_ID]
236
+ path[i, PATH_NUM_TOKENS] = path[i + piece_length, PATH_NUM_TOKENS] + 1
237
+ if score == UNKNOWN_SCORE:
238
+ # Add number of bytes to represent `c` in UTF-8 (minus 1; 1 is already
239
+ # added above).
240
+ path[i, PATH_NUM_TOKENS] += (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
241
+
242
+ # If it reaches a sentinel row, break the loop.
243
+ if score == UNKNOWN_SCORE:
244
+ break
245
+
246
+ # Decode the best path from the beginning to get the token IDs.
247
+ pos = 0
248
+ token_ids = np.zeros(path[0, PATH_NUM_TOKENS], dtype=np.int32)
249
+ token_pos = 0
250
+ while pos < len(data):
251
+ if path[pos, PATH_TOKEN_ID] >= 0:
252
+ token_ids[token_pos] = path[pos, PATH_TOKEN_ID]
253
+ if token_ids[token_pos] != BOUNDARY_TOKEN_ID:
254
+ token_pos += 1
255
+ else:
256
+ # Fall back to byte tokens.
257
+ c = data[pos]
258
+ s = 1 + (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
259
+ # Add byte tokens representing UTF-8 bytes.
260
+ for i in range(s):
261
+ b = c if s == 1 else (0xF00 >> s) & 0xFF if i == 0 else 0x80
262
+ token_ids[token_pos] = bytes[b | ((c >> (s - i - 1) * 6) & 0x3F)]
263
+ token_pos += 1
264
+
265
+ # Ensure that pos should increase by at least 1.
266
+ assert path[pos, PATH_TOKEN_LENGTH] > 0, (pos, path[pos])
267
+ pos += path[pos, PATH_TOKEN_LENGTH]
268
+
269
+ return token_ids[:token_pos]
270
+
271
+ def encode(self, data: str) -> np.ndarray:
272
+ """Encodes a string into a sequence of token IDs."""
273
+ if self._sp_token_matcher is not None:
274
+ data = self._sp_token_matcher.sub(BOUNDARY_CHAR + "\\1" + BOUNDARY_CHAR, data)
275
+ if self._matcher is not None:
276
+ data = self._matcher.sub(BOUNDARY_CHAR + "\\1" + BOUNDARY_CHAR, data)
277
+ return np.asarray(
278
+ self._encode(
279
+ self._to_suffix_id,
280
+ self._table,
281
+ self._bytes,
282
+ # Convert a string into a numpy array of Unicode code points.
283
+ # NOTE: This skips UTF-32 BOM.
284
+ np.frombuffer(data.encode("utf-32"), dtype=np.int32)[1:],
285
+ )
286
+ )
287
+
288
+ def encode_as_tokens(self, data: str) -> list[str]:
289
+ """Encodes a string into a sequence of tokens."""
290
+ return [self._tokens[token_id] for token_id in self.encode(data)]
291
+
292
+
293
+ class Plamo3Tokenizer(PreTrainedTokenizer):
294
+ vocab_files_names = VOCAB_FILES_NAMES
295
+ model_input_names = ["input_ids", "attention_mask"]
296
+
297
+ _save_files = [
298
+ "special_tokens_map.json",
299
+ "tokenization_plamo.py",
300
+ "tokenizer.jsonl",
301
+ "tokenizer_config.json",
302
+ ]
303
+
304
+ def __init__(
305
+ self,
306
+ vocab_file: str,
307
+ unk_token: str = "<|plamo:unk|>",
308
+ bos_token: str = "<|plamo:bos|>",
309
+ eos_token: str = "<|plamo:eos|>",
310
+ pad_token: str = "<|plamo:pad|>",
311
+ cls_token: str | None = None,
312
+ sep_token: str | None = None,
313
+ mask_token: str | None = None,
314
+ clean_up_tokenization_spaces: bool = False,
315
+ break_around_consecutive_spaces_threshold: int | None = None,
316
+ break_around_repeated_chars_threshold: int | None = None,
317
+ **kwargs: Any,
318
+ ) -> None:
319
+ """Tokenizer for PLaMo.
320
+
321
+ Args:
322
+ vocab_file (str): Vocabrary file path.
323
+ unk_token (str): Unknown token.
324
+ bos_token (str): Beginning of sentence token.
325
+ eos_token (str): End of sentence token.
326
+ pad_token (str): Padding token.
327
+ cls_token (str):
328
+ Classification token, to extract a summary of an input sequence leveraging self-attention along the
329
+ full depth of the model.
330
+ sep_token (str): Separation token, to separate context and query in an input sequence.
331
+ mask_token (str): Mask token, to use when training a model with masked-language modeling.
332
+ clean_up_tokenization_spaces (bool): Whether or not to clean up the tokenization spaces.
333
+ break_around_consecutive_spaces_threshold (int, optional): Minimum number of consecutive spaces to trigger
334
+ boundary splits. If None, consecutive spaces won't trigger splits.
335
+ break_around_repeated_chars_threshold (int, optional): Minimum number of repeated characters to trigger
336
+ boundary splits. If None, repeated characters won't trigger splits.
337
+ num_threads (int):
338
+ Number of threads. This value will be ignored if one of `PLAMO_TOKENIZER_NUM_THREADS` or
339
+ `RAYON_NUM_THREADS` is set as an environment variable.
340
+ """
341
+ if "add_bos_token" not in kwargs:
342
+ kwargs["add_bos_token"] = False
343
+ if "add_eos_token" not in kwargs:
344
+ kwargs["add_eos_token"] = False
345
+ with open(vocab_file, encoding="utf-8") as f:
346
+ self.data: list[Any] = [json.loads(line) for line in f]
347
+ self.vocab: dict[str, int] = {v[0]: i for i, v in enumerate(self.data)}
348
+ self.aho_corasick = AhoCorasick()
349
+ self.break_around_consecutive_spaces_threshold = break_around_consecutive_spaces_threshold
350
+ self.break_around_repeated_chars_threshold = break_around_repeated_chars_threshold
351
+ self.aho_corasick.build(
352
+ self.data,
353
+ break_around_consecutive_spaces_threshold=self.break_around_consecutive_spaces_threshold,
354
+ break_around_repeated_chars_threshold=self.break_around_repeated_chars_threshold,
355
+ )
356
+ self.vocab_file = vocab_file
357
+ self.add_bos_token = kwargs["add_bos_token"]
358
+ self.add_eos_token = kwargs["add_eos_token"]
359
+
360
+ super().__init__( # type: ignore[no-untyped-call]
361
+ vocab_file=vocab_file,
362
+ unk_token=unk_token,
363
+ bos_token=bos_token,
364
+ eos_token=eos_token,
365
+ pad_token=pad_token,
366
+ cls_token=cls_token,
367
+ sep_token=sep_token,
368
+ mask_token=mask_token,
369
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
370
+ break_around_consecutive_spaces_threshold=break_around_consecutive_spaces_threshold,
371
+ break_around_repeated_chars_threshold=break_around_repeated_chars_threshold,
372
+ **kwargs,
373
+ )
374
+
375
+ # the functions below are copied from hf transformers LlamaTokenizer's implementation to fix the behaviour of the tokenizer
376
+ # https://github.com/huggingface/transformers/blob/v4.30.2/src/transformers/models/llama/tokenization_llama.py
377
+
378
+ def __getstate__(self) -> dict[str, Any]:
379
+ state = self.__dict__.copy()
380
+ state["aho_corasick"] = None
381
+ return state
382
+
383
+ def __setstate__(self, d: dict[str, Any]) -> None:
384
+ self.__dict__ = d
385
+ self.aho_corasick = AhoCorasick()
386
+ self.aho_corasick.build(
387
+ self.data,
388
+ break_around_consecutive_spaces_threshold=self.break_around_consecutive_spaces_threshold,
389
+ break_around_repeated_chars_threshold=self.break_around_repeated_chars_threshold,
390
+ )
391
+
392
+ @property
393
+ def vocab_size(self) -> Any:
394
+ """Returns vocab size"""
395
+ return len(self.data)
396
+
397
+ def token_to_score(self, token: str) -> float | None:
398
+ """Returns score of the token"""
399
+ token_id = self.vocab.get(token, None)
400
+ return None if token_id is None else self.data[token_id][1]
401
+
402
+ def get_vocab(self) -> dict[str, int]:
403
+ """Returns vocab as a dict"""
404
+ vocab = self.vocab.copy()
405
+ vocab.update(self.added_tokens_encoder)
406
+ return vocab
407
+
408
+ def convert_tokens_to_string(self, tokens: list[str]) -> str:
409
+ """Converts a sequence of tokens (string) in a single string."""
410
+ return b"".join(
411
+ [bytes([int(t[3:5], 16)]) if t.startswith("<0x") else t.encode("utf-8") for t in tokens]
412
+ ).decode("utf-8", errors="replace")
413
+
414
+ def _tokenize(self, text: str, **kwargs: Any) -> list[str]:
415
+ """Returns a tokenized string."""
416
+ return self.aho_corasick.encode_as_tokens(text)
417
+
418
+ def _convert_token_to_id(self, token: str) -> int:
419
+ """Converts a token (str) in an id using the vocab."""
420
+ return self.vocab.get(token, 0)
421
+
422
+ def _convert_id_to_token(self, index: int) -> str:
423
+ """Converts an index (integer) in a token (str) using the vocab."""
424
+ return self.data[index][0] # type: ignore[no-any-return]
425
+
426
+ def build_inputs_with_special_tokens(
427
+ self, token_ids_0: list[int], token_ids_1: list[int] | None = None
428
+ ) -> list[int]:
429
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
430
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
431
+
432
+ output = bos_token_id + token_ids_0 + eos_token_id
433
+
434
+ if token_ids_1 is not None:
435
+ output = output + bos_token_id + token_ids_1 + eos_token_id
436
+
437
+ return output
438
+
439
+ def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None) -> tuple[str]:
440
+ """
441
+ Save the vocabulary and special tokens file to a directory.
442
+
443
+ Args:
444
+ save_directory (`str`):
445
+ The directory in which to save the vocabulary.
446
+
447
+ Returns:
448
+ `Tuple(str)`: Paths to the files saved.
449
+ """
450
+ if not os.path.isdir(save_directory):
451
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
452
+ return ("",)
453
+ out_vocab_file = os.path.join(
454
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
455
+ )
456
+
457
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
458
+ copyfile(self.vocab_file, out_vocab_file)
459
+ elif not os.path.isfile(self.vocab_file):
460
+ with open(out_vocab_file, "w") as f:
461
+ for token in self.data:
462
+ print(json.dumps(token, ensure_ascii=False), file=f)
463
+
464
+ return (out_vocab_file,)
tokenizer.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<|plamo:unk|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<|plamo:bos|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<|plamo:eos|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<|plamo:pad|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ }
37
+ },
38
+ "auto_map": {
39
+ "AutoTokenizer": [
40
+ "tokenization_plamo.Plamo3Tokenizer",
41
+ null
42
+ ]
43
+ },
44
+ "bos_token": "<|plamo:bos|>",
45
+ "break_around_consecutive_spaces_threshold": 2,
46
+ "break_around_repeated_chars_threshold": 4,
47
+ "clean_up_tokenization_spaces": false,
48
+ "cls_token": null,
49
+ "eos_token": "<|plamo:eos|>",
50
+ "extra_special_tokens": {},
51
+ "local_file_only": true,
52
+ "mask_token": null,
53
+ "model_max_length": 1000000000000000019884624838656,
54
+ "pad_token": "<|plamo:pad|>",
55
+ "padding_side": "right",
56
+ "sep_token": null,
57
+ "split_special_tokens": false,
58
+ "tokenizer_class": "Plamo3Tokenizer",
59
+ "unk_token": "<|plamo:unk|>"
60
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.0,
3
+ "total_flos": 93553682546688.0,
4
+ "train_loss": 0.9335812793004201,
5
+ "train_runtime": 7476.0796,
6
+ "train_samples_per_second": 17.685,
7
+ "train_steps_per_second": 0.276
8
+ }
trainer_log.jsonl ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"current_steps": 10, "total_steps": 2066, "loss": 1.7236, "lr": 7.258064516129033e-07, "epoch": 0.00968054211035818, "percentage": 0.48, "elapsed_time": "0:00:36", "remaining_time": "2:04:00"}
2
+ {"current_steps": 20, "total_steps": 2066, "loss": 1.627, "lr": 1.5322580645161292e-06, "epoch": 0.01936108422071636, "percentage": 0.97, "elapsed_time": "0:01:11", "remaining_time": "2:01:15"}
3
+ {"current_steps": 30, "total_steps": 2066, "loss": 1.4318, "lr": 2.338709677419355e-06, "epoch": 0.02904162633107454, "percentage": 1.45, "elapsed_time": "0:01:46", "remaining_time": "2:00:55"}
4
+ {"current_steps": 40, "total_steps": 2066, "loss": 1.3885, "lr": 3.145161290322581e-06, "epoch": 0.03872216844143272, "percentage": 1.94, "elapsed_time": "0:02:20", "remaining_time": "1:59:01"}
5
+ {"current_steps": 50, "total_steps": 2066, "loss": 1.3955, "lr": 3.951612903225807e-06, "epoch": 0.0484027105517909, "percentage": 2.42, "elapsed_time": "0:02:59", "remaining_time": "2:00:19"}
6
+ {"current_steps": 60, "total_steps": 2066, "loss": 1.2718, "lr": 4.758064516129033e-06, "epoch": 0.05808325266214908, "percentage": 2.9, "elapsed_time": "0:03:32", "remaining_time": "1:58:09"}
7
+ {"current_steps": 70, "total_steps": 2066, "loss": 1.3654, "lr": 4.999849475897687e-06, "epoch": 0.06776379477250725, "percentage": 3.39, "elapsed_time": "0:04:11", "remaining_time": "1:59:35"}
8
+ {"current_steps": 80, "total_steps": 2066, "loss": 1.2831, "lr": 4.999112258623345e-06, "epoch": 0.07744433688286544, "percentage": 3.87, "elapsed_time": "0:04:45", "remaining_time": "1:58:13"}
9
+ {"current_steps": 90, "total_steps": 2066, "loss": 1.3002, "lr": 4.997760881838323e-06, "epoch": 0.08712487899322362, "percentage": 4.36, "elapsed_time": "0:05:17", "remaining_time": "1:56:21"}
10
+ {"current_steps": 100, "total_steps": 2066, "loss": 1.287, "lr": 4.995795677644913e-06, "epoch": 0.0968054211035818, "percentage": 4.84, "elapsed_time": "0:05:51", "remaining_time": "1:55:03"}
11
+ {"current_steps": 110, "total_steps": 2066, "loss": 1.2492, "lr": 4.993217128994149e-06, "epoch": 0.10648596321393998, "percentage": 5.32, "elapsed_time": "0:07:17", "remaining_time": "2:09:35"}
12
+ {"current_steps": 120, "total_steps": 2066, "loss": 1.2794, "lr": 4.9900258695671176e-06, "epoch": 0.11616650532429816, "percentage": 5.81, "elapsed_time": "0:07:53", "remaining_time": "2:08:05"}
13
+ {"current_steps": 130, "total_steps": 2066, "loss": 1.2506, "lr": 4.986222683619237e-06, "epoch": 0.12584704743465633, "percentage": 6.29, "elapsed_time": "0:08:28", "remaining_time": "2:06:16"}
14
+ {"current_steps": 140, "total_steps": 2066, "loss": 1.2609, "lr": 4.981808505787523e-06, "epoch": 0.1355275895450145, "percentage": 6.78, "elapsed_time": "0:09:01", "remaining_time": "2:04:14"}
15
+ {"current_steps": 150, "total_steps": 2066, "loss": 1.2329, "lr": 4.976784420860898e-06, "epoch": 0.1452081316553727, "percentage": 7.26, "elapsed_time": "0:09:35", "remaining_time": "2:02:31"}
16
+ {"current_steps": 160, "total_steps": 2066, "loss": 1.3551, "lr": 4.971151663513608e-06, "epoch": 0.15488867376573087, "percentage": 7.74, "elapsed_time": "0:10:12", "remaining_time": "2:01:34"}
17
+ {"current_steps": 170, "total_steps": 2066, "loss": 1.261, "lr": 4.964911618001794e-06, "epoch": 0.16456921587608905, "percentage": 8.23, "elapsed_time": "0:10:45", "remaining_time": "1:59:54"}
18
+ {"current_steps": 180, "total_steps": 2066, "loss": 1.2055, "lr": 4.958065817823318e-06, "epoch": 0.17424975798644723, "percentage": 8.71, "elapsed_time": "0:11:18", "remaining_time": "1:58:31"}
19
+ {"current_steps": 190, "total_steps": 2066, "loss": 1.3022, "lr": 4.950615945340893e-06, "epoch": 0.18393030009680542, "percentage": 9.2, "elapsed_time": "0:11:52", "remaining_time": "1:57:15"}
20
+ {"current_steps": 200, "total_steps": 2066, "loss": 1.2701, "lr": 4.942563831368653e-06, "epoch": 0.1936108422071636, "percentage": 9.68, "elapsed_time": "0:12:26", "remaining_time": "1:56:02"}
21
+ {"current_steps": 200, "total_steps": 2066, "eval_loss": 1.3192224502563477, "epoch": 0.1936108422071636, "percentage": 9.68, "elapsed_time": "0:12:37", "remaining_time": "1:57:46"}
22
+ {"current_steps": 210, "total_steps": 2066, "loss": 1.277, "lr": 4.933911454722217e-06, "epoch": 0.20329138431752178, "percentage": 10.16, "elapsed_time": "0:14:39", "remaining_time": "2:09:33"}
23
+ {"current_steps": 220, "total_steps": 2066, "loss": 1.2418, "lr": 4.924660941732403e-06, "epoch": 0.21297192642787996, "percentage": 10.65, "elapsed_time": "0:15:16", "remaining_time": "2:08:09"}
24
+ {"current_steps": 230, "total_steps": 2066, "loss": 1.294, "lr": 4.914814565722671e-06, "epoch": 0.22265246853823814, "percentage": 11.13, "elapsed_time": "0:15:48", "remaining_time": "2:06:13"}
25
+ {"current_steps": 240, "total_steps": 2066, "loss": 1.2823, "lr": 4.9043747464504586e-06, "epoch": 0.23233301064859632, "percentage": 11.62, "elapsed_time": "0:16:28", "remaining_time": "2:05:24"}
26
+ {"current_steps": 250, "total_steps": 2066, "loss": 1.2753, "lr": 4.893344049512519e-06, "epoch": 0.2420135527589545, "percentage": 12.1, "elapsed_time": "0:17:04", "remaining_time": "2:04:02"}
27
+ {"current_steps": 260, "total_steps": 2066, "loss": 1.1851, "lr": 4.881725185714421e-06, "epoch": 0.25169409486931266, "percentage": 12.58, "elapsed_time": "0:17:37", "remaining_time": "2:02:22"}
28
+ {"current_steps": 270, "total_steps": 2066, "loss": 1.2901, "lr": 4.869521010404373e-06, "epoch": 0.26137463697967084, "percentage": 13.07, "elapsed_time": "0:18:13", "remaining_time": "2:01:16"}
29
+ {"current_steps": 280, "total_steps": 2066, "loss": 1.246, "lr": 4.856734522771512e-06, "epoch": 0.271055179090029, "percentage": 13.55, "elapsed_time": "0:18:56", "remaining_time": "2:00:47"}
30
+ {"current_steps": 290, "total_steps": 2066, "loss": 1.204, "lr": 4.843368865108847e-06, "epoch": 0.2807357212003872, "percentage": 14.04, "elapsed_time": "0:19:30", "remaining_time": "1:59:30"}
31
+ {"current_steps": 300, "total_steps": 2066, "loss": 1.271, "lr": 4.8294273220410494e-06, "epoch": 0.2904162633107454, "percentage": 14.52, "elapsed_time": "0:20:06", "remaining_time": "1:58:23"}
32
+ {"current_steps": 310, "total_steps": 2066, "loss": 1.307, "lr": 4.814913319717238e-06, "epoch": 0.30009680542110356, "percentage": 15.0, "elapsed_time": "0:28:47", "remaining_time": "2:43:06"}
33
+ {"current_steps": 320, "total_steps": 2066, "loss": 1.273, "lr": 4.799830424969008e-06, "epoch": 0.30977734753146174, "percentage": 15.49, "elapsed_time": "0:29:21", "remaining_time": "2:40:13"}
34
+ {"current_steps": 330, "total_steps": 2066, "loss": 1.2719, "lr": 4.784182344433878e-06, "epoch": 0.3194578896418199, "percentage": 15.97, "elapsed_time": "0:29:56", "remaining_time": "2:37:32"}
35
+ {"current_steps": 340, "total_steps": 2066, "loss": 1.2732, "lr": 4.767972923644377e-06, "epoch": 0.3291384317521781, "percentage": 16.46, "elapsed_time": "0:30:32", "remaining_time": "2:35:01"}
36
+ {"current_steps": 350, "total_steps": 2066, "loss": 1.3289, "lr": 4.751206146083002e-06, "epoch": 0.3388189738625363, "percentage": 16.94, "elapsed_time": "0:31:06", "remaining_time": "2:32:29"}
37
+ {"current_steps": 360, "total_steps": 2066, "loss": 1.2303, "lr": 4.7338861322032724e-06, "epoch": 0.34849951597289447, "percentage": 17.42, "elapsed_time": "0:31:38", "remaining_time": "2:29:56"}
38
+ {"current_steps": 370, "total_steps": 2066, "loss": 1.1788, "lr": 4.716017138417126e-06, "epoch": 0.35818005808325265, "percentage": 17.91, "elapsed_time": "0:32:12", "remaining_time": "2:27:37"}
39
+ {"current_steps": 380, "total_steps": 2066, "loss": 1.2543, "lr": 4.697603556048899e-06, "epoch": 0.36786060019361083, "percentage": 18.39, "elapsed_time": "0:32:44", "remaining_time": "2:25:18"}
40
+ {"current_steps": 390, "total_steps": 2066, "loss": 1.3091, "lr": 4.6786499102561525e-06, "epoch": 0.377541142303969, "percentage": 18.88, "elapsed_time": "0:33:18", "remaining_time": "2:23:07"}
41
+ {"current_steps": 400, "total_steps": 2066, "loss": 1.2693, "lr": 4.659160858917614e-06, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:34:01", "remaining_time": "2:21:44"}
42
+ {"current_steps": 400, "total_steps": 2066, "eval_loss": 1.3101810216903687, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:34:13", "remaining_time": "2:22:34"}
43
+ {"current_steps": 310, "total_steps": 2066, "loss": 1.307, "lr": 4.814913319717238e-06, "epoch": 0.30009680542110356, "percentage": 15.0, "elapsed_time": "0:00:38", "remaining_time": "0:03:35"}
44
+ {"current_steps": 320, "total_steps": 2066, "loss": 1.273, "lr": 4.799830424969008e-06, "epoch": 0.30977734753146174, "percentage": 15.49, "elapsed_time": "0:01:11", "remaining_time": "0:06:29"}
45
+ {"current_steps": 330, "total_steps": 2066, "loss": 1.2719, "lr": 4.784182344433878e-06, "epoch": 0.3194578896418199, "percentage": 15.97, "elapsed_time": "0:01:45", "remaining_time": "0:09:16"}
46
+ {"current_steps": 340, "total_steps": 2066, "loss": 1.2732, "lr": 4.767972923644377e-06, "epoch": 0.3291384317521781, "percentage": 16.46, "elapsed_time": "0:02:20", "remaining_time": "0:11:52"}
47
+ {"current_steps": 350, "total_steps": 2066, "loss": 1.3289, "lr": 4.751206146083002e-06, "epoch": 0.3388189738625363, "percentage": 16.94, "elapsed_time": "0:02:54", "remaining_time": "0:14:14"}
48
+ {"current_steps": 360, "total_steps": 2066, "loss": 1.2303, "lr": 4.7338861322032724e-06, "epoch": 0.34849951597289447, "percentage": 17.42, "elapsed_time": "0:03:26", "remaining_time": "0:16:18"}
49
+ {"current_steps": 370, "total_steps": 2066, "loss": 1.1788, "lr": 4.716017138417126e-06, "epoch": 0.35818005808325265, "percentage": 17.91, "elapsed_time": "0:04:00", "remaining_time": "0:18:22"}
50
+ {"current_steps": 380, "total_steps": 2066, "loss": 1.2543, "lr": 4.697603556048899e-06, "epoch": 0.36786060019361083, "percentage": 18.39, "elapsed_time": "0:04:32", "remaining_time": "0:20:10"}
51
+ {"current_steps": 390, "total_steps": 2066, "loss": 1.3091, "lr": 4.6786499102561525e-06, "epoch": 0.377541142303969, "percentage": 18.88, "elapsed_time": "0:05:05", "remaining_time": "0:21:52"}
52
+ {"current_steps": 400, "total_steps": 2066, "loss": 1.2693, "lr": 4.659160858917614e-06, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:05:48", "remaining_time": "0:24:10"}
53
+ {"current_steps": 400, "total_steps": 2066, "eval_loss": 1.3101810216903687, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:05:59", "remaining_time": "0:24:56"}
54
+ {"current_steps": 410, "total_steps": 2066, "loss": 1.2866, "lr": 4.639141191488498e-06, "epoch": 0.3969022265246854, "percentage": 19.85, "elapsed_time": "0:07:46", "remaining_time": "0:31:25"}
55
+ {"current_steps": 420, "total_steps": 2066, "loss": 1.3088, "lr": 4.618595827823486e-06, "epoch": 0.40658276863504356, "percentage": 20.33, "elapsed_time": "0:08:32", "remaining_time": "0:33:27"}
56
+ {"current_steps": 430, "total_steps": 2066, "loss": 1.2445, "lr": 4.597529816967676e-06, "epoch": 0.41626331074540174, "percentage": 20.81, "elapsed_time": "0:09:04", "remaining_time": "0:34:31"}
57
+ {"current_steps": 440, "total_steps": 2066, "loss": 1.2679, "lr": 4.575948335915769e-06, "epoch": 0.4259438528557599, "percentage": 21.3, "elapsed_time": "0:09:43", "remaining_time": "0:35:58"}
58
+ {"current_steps": 450, "total_steps": 2066, "loss": 1.2699, "lr": 4.553856688339817e-06, "epoch": 0.4356243949661181, "percentage": 21.78, "elapsed_time": "0:10:24", "remaining_time": "0:37:22"}
59
+ {"current_steps": 460, "total_steps": 2066, "loss": 1.2381, "lr": 4.531260303285841e-06, "epoch": 0.4453049370764763, "percentage": 22.27, "elapsed_time": "0:10:59", "remaining_time": "0:38:23"}
60
+ {"current_steps": 470, "total_steps": 2066, "loss": 1.3089, "lr": 4.50816473383964e-06, "epoch": 0.45498547918683446, "percentage": 22.75, "elapsed_time": "0:11:49", "remaining_time": "0:40:10"}
61
+ {"current_steps": 480, "total_steps": 2066, "loss": 1.2271, "lr": 4.484575655762107e-06, "epoch": 0.46466602129719264, "percentage": 23.23, "elapsed_time": "0:12:21", "remaining_time": "0:40:50"}
62
+ {"current_steps": 490, "total_steps": 2066, "loss": 1.2136, "lr": 4.460498866094412e-06, "epoch": 0.4743465634075508, "percentage": 23.72, "elapsed_time": "0:12:57", "remaining_time": "0:41:39"}
63
+ {"current_steps": 500, "total_steps": 2066, "loss": 1.2747, "lr": 4.435940281733369e-06, "epoch": 0.484027105517909, "percentage": 24.2, "elapsed_time": "0:13:30", "remaining_time": "0:42:17"}
64
+ {"current_steps": 510, "total_steps": 2066, "loss": 1.265, "lr": 4.410905937977353e-06, "epoch": 0.4937076476282672, "percentage": 24.69, "elapsed_time": "0:15:07", "remaining_time": "0:46:10"}
65
+ {"current_steps": 520, "total_steps": 2066, "loss": 1.2895, "lr": 4.385401987043118e-06, "epoch": 0.5033881897386253, "percentage": 25.17, "elapsed_time": "0:15:39", "remaining_time": "0:46:34"}
66
+ {"current_steps": 530, "total_steps": 2066, "loss": 1.2376, "lr": 4.359434696553889e-06, "epoch": 0.5130687318489835, "percentage": 25.65, "elapsed_time": "0:16:14", "remaining_time": "0:47:03"}
67
+ {"current_steps": 540, "total_steps": 2066, "loss": 1.2575, "lr": 4.333010447999077e-06, "epoch": 0.5227492739593417, "percentage": 26.14, "elapsed_time": "0:16:48", "remaining_time": "0:47:28"}
68
+ {"current_steps": 550, "total_steps": 2066, "loss": 1.267, "lr": 4.3061357351660285e-06, "epoch": 0.5324298160696999, "percentage": 26.62, "elapsed_time": "0:17:28", "remaining_time": "0:48:10"}
69
+ {"current_steps": 560, "total_steps": 2066, "loss": 1.2584, "lr": 4.27881716254417e-06, "epoch": 0.542110358180058, "percentage": 27.11, "elapsed_time": "0:18:04", "remaining_time": "0:48:36"}
70
+ {"current_steps": 570, "total_steps": 2066, "loss": 1.2263, "lr": 4.251061443701941e-06, "epoch": 0.5517909002904162, "percentage": 27.59, "elapsed_time": "0:18:42", "remaining_time": "0:49:07"}
71
+ {"current_steps": 580, "total_steps": 2066, "loss": 1.2231, "lr": 4.222875399636938e-06, "epoch": 0.5614714424007744, "percentage": 28.07, "elapsed_time": "0:19:14", "remaining_time": "0:49:18"}
72
+ {"current_steps": 590, "total_steps": 2066, "loss": 1.2656, "lr": 4.194265957099638e-06, "epoch": 0.5711519845111326, "percentage": 28.56, "elapsed_time": "0:19:47", "remaining_time": "0:49:29"}
73
+ {"current_steps": 600, "total_steps": 2066, "loss": 1.2341, "lr": 4.165240146891145e-06, "epoch": 0.5808325266214908, "percentage": 29.04, "elapsed_time": "0:20:20", "remaining_time": "0:49:41"}
74
+ {"current_steps": 600, "total_steps": 2066, "eval_loss": 1.3036646842956543, "epoch": 0.5808325266214908, "percentage": 29.04, "elapsed_time": "0:20:31", "remaining_time": "0:50:08"}
75
+ {"current_steps": 610, "total_steps": 2066, "loss": 1.2413, "lr": 4.1358051021353655e-06, "epoch": 0.590513068731849, "percentage": 29.53, "elapsed_time": "0:22:10", "remaining_time": "0:52:55"}
76
+ {"current_steps": 620, "total_steps": 2066, "loss": 1.2342, "lr": 4.1059680565260315e-06, "epoch": 0.6001936108422071, "percentage": 30.01, "elapsed_time": "0:22:53", "remaining_time": "0:53:22"}
77
+ {"current_steps": 630, "total_steps": 2066, "loss": 1.1899, "lr": 4.0757363425490185e-06, "epoch": 0.6098741529525653, "percentage": 30.49, "elapsed_time": "0:23:28", "remaining_time": "0:53:30"}
78
+ {"current_steps": 640, "total_steps": 2066, "loss": 1.1912, "lr": 4.04511738968037e-06, "epoch": 0.6195546950629235, "percentage": 30.98, "elapsed_time": "0:24:12", "remaining_time": "0:53:57"}
79
+ {"current_steps": 650, "total_steps": 2066, "loss": 1.2066, "lr": 4.0141187225605064e-06, "epoch": 0.6292352371732817, "percentage": 31.46, "elapsed_time": "0:24:50", "remaining_time": "0:54:07"}
80
+ {"current_steps": 660, "total_steps": 2066, "loss": 1.2394, "lr": 3.98274795914503e-06, "epoch": 0.6389157792836399, "percentage": 31.95, "elapsed_time": "0:25:22", "remaining_time": "0:54:03"}
81
+ {"current_steps": 670, "total_steps": 2066, "loss": 1.2069, "lr": 3.951012808832603e-06, "epoch": 0.648596321393998, "percentage": 32.43, "elapsed_time": "0:25:57", "remaining_time": "0:54:04"}
82
+ {"current_steps": 680, "total_steps": 2066, "loss": 1.2724, "lr": 3.918921070570361e-06, "epoch": 0.6582768635043562, "percentage": 32.91, "elapsed_time": "0:26:31", "remaining_time": "0:54:03"}
83
+ {"current_steps": 690, "total_steps": 2066, "loss": 1.3105, "lr": 3.886480630937307e-06, "epoch": 0.6679574056147144, "percentage": 33.4, "elapsed_time": "0:27:02", "remaining_time": "0:53:55"}
84
+ {"current_steps": 700, "total_steps": 2066, "loss": 1.1989, "lr": 3.853699462206183e-06, "epoch": 0.6776379477250726, "percentage": 33.88, "elapsed_time": "0:27:35", "remaining_time": "0:53:50"}
85
+ {"current_steps": 710, "total_steps": 2066, "loss": 1.3256, "lr": 3.820585620384265e-06, "epoch": 0.6873184898354308, "percentage": 34.37, "elapsed_time": "0:29:14", "remaining_time": "0:55:50"}
86
+ {"current_steps": 720, "total_steps": 2066, "loss": 1.2206, "lr": 3.787147243233602e-06, "epoch": 0.6969990319457889, "percentage": 34.85, "elapsed_time": "0:29:46", "remaining_time": "0:55:39"}
87
+ {"current_steps": 730, "total_steps": 2066, "loss": 1.2245, "lr": 3.753392548271144e-06, "epoch": 0.7066795740561471, "percentage": 35.33, "elapsed_time": "0:30:18", "remaining_time": "0:55:28"}
88
+ {"current_steps": 740, "total_steps": 2066, "loss": 1.2685, "lr": 3.7193298307492855e-06, "epoch": 0.7163601161665053, "percentage": 35.82, "elapsed_time": "0:30:56", "remaining_time": "0:55:26"}
89
+ {"current_steps": 750, "total_steps": 2066, "loss": 1.2379, "lr": 3.6849674616172887e-06, "epoch": 0.7260406582768635, "percentage": 36.3, "elapsed_time": "0:31:28", "remaining_time": "0:55:12"}
90
+ {"current_steps": 760, "total_steps": 2066, "loss": 1.2176, "lr": 3.6503138854641257e-06, "epoch": 0.7357212003872217, "percentage": 36.79, "elapsed_time": "0:32:00", "remaining_time": "0:54:59"}
91
+ {"current_steps": 770, "total_steps": 2066, "loss": 1.2751, "lr": 3.615377618443201e-06, "epoch": 0.7454017424975798, "percentage": 37.27, "elapsed_time": "0:32:33", "remaining_time": "0:54:47"}
92
+ {"current_steps": 780, "total_steps": 2066, "loss": 1.2335, "lr": 3.5801672461795032e-06, "epoch": 0.755082284607938, "percentage": 37.75, "elapsed_time": "0:33:05", "remaining_time": "0:54:32"}
93
+ {"current_steps": 790, "total_steps": 2066, "loss": 1.2816, "lr": 3.5446914216596805e-06, "epoch": 0.7647628267182962, "percentage": 38.24, "elapsed_time": "0:33:36", "remaining_time": "0:54:17"}
94
+ {"current_steps": 800, "total_steps": 2066, "loss": 1.1997, "lr": 3.5089588631055527e-06, "epoch": 0.7744433688286544, "percentage": 38.72, "elapsed_time": "0:34:08", "remaining_time": "0:54:01"}
95
+ {"current_steps": 800, "total_steps": 2066, "eval_loss": 1.2973461151123047, "epoch": 0.7744433688286544, "percentage": 38.72, "elapsed_time": "0:34:19", "remaining_time": "0:54:19"}
96
+ {"current_steps": 810, "total_steps": 2066, "loss": 1.2153, "lr": 3.472978351831606e-06, "epoch": 0.7841239109390126, "percentage": 39.21, "elapsed_time": "0:35:55", "remaining_time": "0:55:43"}
97
+ {"current_steps": 820, "total_steps": 2066, "loss": 1.1981, "lr": 3.436758730086971e-06, "epoch": 0.7938044530493708, "percentage": 39.69, "elapsed_time": "0:36:31", "remaining_time": "0:55:30"}
98
+ {"current_steps": 830, "total_steps": 2066, "loss": 1.2271, "lr": 3.4003088988824323e-06, "epoch": 0.8034849951597289, "percentage": 40.17, "elapsed_time": "0:37:02", "remaining_time": "0:55:09"}
99
+ {"current_steps": 840, "total_steps": 2066, "loss": 1.2394, "lr": 3.363637815802998e-06, "epoch": 0.8131655372700871, "percentage": 40.66, "elapsed_time": "0:37:34", "remaining_time": "0:54:50"}
100
+ {"current_steps": 850, "total_steps": 2066, "loss": 1.2334, "lr": 3.326754492806559e-06, "epoch": 0.8228460793804453, "percentage": 41.14, "elapsed_time": "0:38:06", "remaining_time": "0:54:31"}
101
+ {"current_steps": 860, "total_steps": 2066, "loss": 1.2327, "lr": 3.2896679940091913e-06, "epoch": 0.8325266214908035, "percentage": 41.63, "elapsed_time": "0:38:39", "remaining_time": "0:54:13"}
102
+ {"current_steps": 870, "total_steps": 2066, "loss": 1.2282, "lr": 3.2523874334576456e-06, "epoch": 0.8422071636011617, "percentage": 42.11, "elapsed_time": "0:39:11", "remaining_time": "0:53:52"}
103
+ {"current_steps": 880, "total_steps": 2066, "loss": 1.2132, "lr": 3.214921972889552e-06, "epoch": 0.8518877057115198, "percentage": 42.59, "elapsed_time": "0:39:42", "remaining_time": "0:53:30"}
104
+ {"current_steps": 890, "total_steps": 2066, "loss": 1.2473, "lr": 3.17728081948192e-06, "epoch": 0.861568247821878, "percentage": 43.08, "elapsed_time": "0:40:17", "remaining_time": "0:53:14"}
105
+ {"current_steps": 900, "total_steps": 2066, "loss": 1.2524, "lr": 3.139473223588462e-06, "epoch": 0.8712487899322362, "percentage": 43.56, "elapsed_time": "0:40:56", "remaining_time": "0:53:01"}
106
+ {"current_steps": 910, "total_steps": 2066, "loss": 1.2423, "lr": 3.1015084764663074e-06, "epoch": 0.8809293320425944, "percentage": 44.05, "elapsed_time": "0:42:33", "remaining_time": "0:54:03"}
107
+ {"current_steps": 920, "total_steps": 2066, "loss": 1.1997, "lr": 3.063395907992671e-06, "epoch": 0.8906098741529526, "percentage": 44.53, "elapsed_time": "0:43:20", "remaining_time": "0:53:59"}
108
+ {"current_steps": 930, "total_steps": 2066, "loss": 1.2233, "lr": 3.025144884372021e-06, "epoch": 0.9002904162633107, "percentage": 45.01, "elapsed_time": "0:43:58", "remaining_time": "0:53:43"}
109
+ {"current_steps": 940, "total_steps": 2066, "loss": 1.2115, "lr": 2.9867648058343262e-06, "epoch": 0.9099709583736689, "percentage": 45.5, "elapsed_time": "0:44:33", "remaining_time": "0:53:22"}
110
+ {"current_steps": 950, "total_steps": 2066, "loss": 1.2139, "lr": 2.948265104324941e-06, "epoch": 0.9196515004840271, "percentage": 45.98, "elapsed_time": "0:45:08", "remaining_time": "0:53:01"}
111
+ {"current_steps": 960, "total_steps": 2066, "loss": 1.2201, "lr": 2.9096552411866903e-06, "epoch": 0.9293320425943853, "percentage": 46.47, "elapsed_time": "0:45:41", "remaining_time": "0:52:38"}
112
+ {"current_steps": 970, "total_steps": 2066, "loss": 1.1997, "lr": 2.8709447048347394e-06, "epoch": 0.9390125847047435, "percentage": 46.95, "elapsed_time": "0:46:17", "remaining_time": "0:52:18"}
113
+ {"current_steps": 980, "total_steps": 2066, "loss": 1.2363, "lr": 2.832143008424802e-06, "epoch": 0.9486931268151017, "percentage": 47.43, "elapsed_time": "0:46:52", "remaining_time": "0:51:56"}
114
+ {"current_steps": 990, "total_steps": 2066, "loss": 1.2573, "lr": 2.7932596875152747e-06, "epoch": 0.9583736689254598, "percentage": 47.92, "elapsed_time": "0:47:47", "remaining_time": "0:51:56"}
115
+ {"current_steps": 1000, "total_steps": 2066, "loss": 1.2403, "lr": 2.754304297723862e-06, "epoch": 0.968054211035818, "percentage": 48.4, "elapsed_time": "0:48:19", "remaining_time": "0:51:30"}
116
+ {"current_steps": 1000, "total_steps": 2066, "eval_loss": 1.2926961183547974, "epoch": 0.968054211035818, "percentage": 48.4, "elapsed_time": "0:48:31", "remaining_time": "0:51:43"}
117
+ {"current_steps": 1010, "total_steps": 2066, "loss": 1.2915, "lr": 2.7152864123792716e-06, "epoch": 0.9777347531461762, "percentage": 48.89, "elapsed_time": "0:50:11", "remaining_time": "0:52:28"}
118
+ {"current_steps": 1020, "total_steps": 2066, "loss": 1.2246, "lr": 2.6762156201685627e-06, "epoch": 0.9874152952565344, "percentage": 49.37, "elapsed_time": "0:50:42", "remaining_time": "0:52:00"}
119
+ {"current_steps": 1030, "total_steps": 2066, "loss": 1.302, "lr": 2.6371015227807127e-06, "epoch": 0.9970958373668926, "percentage": 49.85, "elapsed_time": "0:51:16", "remaining_time": "0:51:34"}
120
+ {"current_steps": 1040, "total_steps": 2066, "loss": 1.1438, "lr": 2.5979537325469913e-06, "epoch": 1.0067763794772506, "percentage": 50.34, "elapsed_time": "0:51:49", "remaining_time": "0:51:07"}
121
+ {"current_steps": 1050, "total_steps": 2066, "loss": 0.9893, "lr": 2.558781870078722e-06, "epoch": 1.016456921587609, "percentage": 50.82, "elapsed_time": "0:52:21", "remaining_time": "0:50:39"}
122
+ {"current_steps": 1060, "total_steps": 2066, "loss": 0.9725, "lr": 2.5195955619030064e-06, "epoch": 1.026137463697967, "percentage": 51.31, "elapsed_time": "0:52:57", "remaining_time": "0:50:15"}
123
+ {"current_steps": 1070, "total_steps": 2066, "loss": 0.9776, "lr": 2.480404438096994e-06, "epoch": 1.0358180058083253, "percentage": 51.79, "elapsed_time": "0:53:30", "remaining_time": "0:49:48"}
124
+ {"current_steps": 1080, "total_steps": 2066, "loss": 1.0161, "lr": 2.441218129921278e-06, "epoch": 1.0454985479186834, "percentage": 52.27, "elapsed_time": "0:54:08", "remaining_time": "0:49:26"}
125
+ {"current_steps": 1090, "total_steps": 2066, "loss": 1.0164, "lr": 2.402046267453009e-06, "epoch": 1.0551790900290416, "percentage": 52.76, "elapsed_time": "0:54:46", "remaining_time": "0:49:03"}
126
+ {"current_steps": 1100, "total_steps": 2066, "loss": 0.9799, "lr": 2.3628984772192885e-06, "epoch": 1.0648596321393997, "percentage": 53.24, "elapsed_time": "0:55:18", "remaining_time": "0:48:33"}
127
+ {"current_steps": 1110, "total_steps": 2066, "loss": 0.9829, "lr": 2.323784379831438e-06, "epoch": 1.074540174249758, "percentage": 53.73, "elapsed_time": "0:56:56", "remaining_time": "0:49:02"}
128
+ {"current_steps": 1120, "total_steps": 2066, "loss": 0.9397, "lr": 2.2847135876207292e-06, "epoch": 1.084220716360116, "percentage": 54.21, "elapsed_time": "0:57:31", "remaining_time": "0:48:35"}
129
+ {"current_steps": 1130, "total_steps": 2066, "loss": 0.9544, "lr": 2.245695702276139e-06, "epoch": 1.0939012584704744, "percentage": 54.7, "elapsed_time": "0:58:12", "remaining_time": "0:48:12"}
130
+ {"current_steps": 1140, "total_steps": 2066, "loss": 0.9867, "lr": 2.2067403124847257e-06, "epoch": 1.1035818005808324, "percentage": 55.18, "elapsed_time": "0:58:44", "remaining_time": "0:47:42"}
131
+ {"current_steps": 1150, "total_steps": 2066, "loss": 0.9843, "lr": 2.167856991575199e-06, "epoch": 1.1132623426911907, "percentage": 55.66, "elapsed_time": "0:59:15", "remaining_time": "0:47:12"}
132
+ {"current_steps": 1160, "total_steps": 2066, "loss": 0.9621, "lr": 2.1290552951652614e-06, "epoch": 1.1229428848015488, "percentage": 56.15, "elapsed_time": "0:59:50", "remaining_time": "0:46:44"}
133
+ {"current_steps": 1170, "total_steps": 2066, "loss": 1.0003, "lr": 2.09034475881331e-06, "epoch": 1.132623426911907, "percentage": 56.63, "elapsed_time": "1:00:26", "remaining_time": "0:46:17"}
134
+ {"current_steps": 1180, "total_steps": 2066, "loss": 0.9598, "lr": 2.0517348956750597e-06, "epoch": 1.1423039690222652, "percentage": 57.12, "elapsed_time": "1:01:03", "remaining_time": "0:45:51"}
135
+ {"current_steps": 1190, "total_steps": 2066, "loss": 0.9328, "lr": 2.0132351941656737e-06, "epoch": 1.1519845111326235, "percentage": 57.6, "elapsed_time": "1:01:38", "remaining_time": "0:45:22"}
136
+ {"current_steps": 1200, "total_steps": 2066, "loss": 0.9994, "lr": 1.9748551156279803e-06, "epoch": 1.1616650532429815, "percentage": 58.08, "elapsed_time": "1:02:10", "remaining_time": "0:44:52"}
137
+ {"current_steps": 1200, "total_steps": 2066, "eval_loss": 1.3311784267425537, "epoch": 1.1616650532429815, "percentage": 58.08, "elapsed_time": "1:02:21", "remaining_time": "0:45:00"}
138
+ {"current_steps": 1210, "total_steps": 2066, "loss": 1.0021, "lr": 1.93660409200733e-06, "epoch": 1.1713455953533398, "percentage": 58.57, "elapsed_time": "1:03:58", "remaining_time": "0:45:15"}
139
+ {"current_steps": 1220, "total_steps": 2066, "loss": 1.0096, "lr": 1.8984915235336934e-06, "epoch": 1.181026137463698, "percentage": 59.05, "elapsed_time": "1:04:31", "remaining_time": "0:44:44"}
140
+ {"current_steps": 1230, "total_steps": 2066, "loss": 1.0096, "lr": 1.860526776411539e-06, "epoch": 1.1907066795740562, "percentage": 59.54, "elapsed_time": "1:05:11", "remaining_time": "0:44:18"}
141
+ {"current_steps": 1240, "total_steps": 2066, "loss": 1.0768, "lr": 1.8227191805180806e-06, "epoch": 1.2003872216844143, "percentage": 60.02, "elapsed_time": "1:05:44", "remaining_time": "0:43:47"}
142
+ {"current_steps": 1250, "total_steps": 2066, "loss": 1.0072, "lr": 1.7850780271104483e-06, "epoch": 1.2100677637947725, "percentage": 60.5, "elapsed_time": "1:06:20", "remaining_time": "0:43:18"}
143
+ {"current_steps": 1260, "total_steps": 2066, "loss": 0.9954, "lr": 1.747612566542356e-06, "epoch": 1.2197483059051306, "percentage": 60.99, "elapsed_time": "1:06:51", "remaining_time": "0:42:45"}
144
+ {"current_steps": 1270, "total_steps": 2066, "loss": 0.9856, "lr": 1.7103320059908093e-06, "epoch": 1.229428848015489, "percentage": 61.47, "elapsed_time": "1:07:23", "remaining_time": "0:42:14"}
145
+ {"current_steps": 1280, "total_steps": 2066, "loss": 0.9882, "lr": 1.6732455071934424e-06, "epoch": 1.239109390125847, "percentage": 61.96, "elapsed_time": "1:07:55", "remaining_time": "0:41:42"}
146
+ {"current_steps": 1290, "total_steps": 2066, "loss": 0.9218, "lr": 1.6363621841970022e-06, "epoch": 1.2487899322362053, "percentage": 62.44, "elapsed_time": "1:08:36", "remaining_time": "0:41:16"}
147
+ {"current_steps": 1300, "total_steps": 2066, "loss": 0.9803, "lr": 1.5996911011175675e-06, "epoch": 1.2584704743465633, "percentage": 62.92, "elapsed_time": "1:09:08", "remaining_time": "0:40:44"}
148
+ {"current_steps": 1310, "total_steps": 2066, "loss": 0.9732, "lr": 1.5632412699130306e-06, "epoch": 1.2681510164569216, "percentage": 63.41, "elapsed_time": "1:10:43", "remaining_time": "0:40:48"}
149
+ {"current_steps": 1320, "total_steps": 2066, "loss": 0.9656, "lr": 1.5270216481683954e-06, "epoch": 1.2778315585672797, "percentage": 63.89, "elapsed_time": "1:11:17", "remaining_time": "0:40:17"}
150
+ {"current_steps": 1330, "total_steps": 2066, "loss": 0.968, "lr": 1.4910411368944483e-06, "epoch": 1.287512100677638, "percentage": 64.38, "elapsed_time": "1:11:53", "remaining_time": "0:39:46"}
151
+ {"current_steps": 1340, "total_steps": 2066, "loss": 1.0192, "lr": 1.4553085783403201e-06, "epoch": 1.297192642787996, "percentage": 64.86, "elapsed_time": "1:12:24", "remaining_time": "0:39:13"}
152
+ {"current_steps": 1350, "total_steps": 2066, "loss": 0.9862, "lr": 1.419832753820496e-06, "epoch": 1.3068731848983544, "percentage": 65.34, "elapsed_time": "1:13:02", "remaining_time": "0:38:44"}
153
+ {"current_steps": 1360, "total_steps": 2066, "loss": 0.9557, "lr": 1.3846223815568005e-06, "epoch": 1.3165537270087124, "percentage": 65.83, "elapsed_time": "1:13:35", "remaining_time": "0:38:12"}
154
+ {"current_steps": 1370, "total_steps": 2066, "loss": 0.9833, "lr": 1.349686114535875e-06, "epoch": 1.3262342691190707, "percentage": 66.31, "elapsed_time": "1:14:11", "remaining_time": "0:37:41"}
155
+ {"current_steps": 1380, "total_steps": 2066, "loss": 1.0331, "lr": 1.3150325383827117e-06, "epoch": 1.3359148112294288, "percentage": 66.8, "elapsed_time": "1:14:43", "remaining_time": "0:37:08"}
156
+ {"current_steps": 1390, "total_steps": 2066, "loss": 1.0069, "lr": 1.2806701692507162e-06, "epoch": 1.345595353339787, "percentage": 67.28, "elapsed_time": "1:15:16", "remaining_time": "0:36:36"}
157
+ {"current_steps": 1400, "total_steps": 2066, "loss": 0.9531, "lr": 1.2466074517288558e-06, "epoch": 1.3552758954501452, "percentage": 67.76, "elapsed_time": "1:15:49", "remaining_time": "0:36:04"}
158
+ {"current_steps": 1400, "total_steps": 2066, "eval_loss": 1.3317842483520508, "epoch": 1.3552758954501452, "percentage": 67.76, "elapsed_time": "1:16:00", "remaining_time": "0:36:09"}
159
+ {"current_steps": 1410, "total_steps": 2066, "loss": 0.991, "lr": 1.212852756766399e-06, "epoch": 1.3649564375605034, "percentage": 68.25, "elapsed_time": "1:17:41", "remaining_time": "0:36:08"}
160
+ {"current_steps": 1420, "total_steps": 2066, "loss": 0.957, "lr": 1.1794143796157358e-06, "epoch": 1.3746369796708615, "percentage": 68.73, "elapsed_time": "1:18:14", "remaining_time": "0:35:35"}
161
+ {"current_steps": 1430, "total_steps": 2066, "loss": 0.9406, "lr": 1.1463005377938182e-06, "epoch": 1.3843175217812198, "percentage": 69.22, "elapsed_time": "1:18:50", "remaining_time": "0:35:04"}
162
+ {"current_steps": 1440, "total_steps": 2066, "loss": 0.958, "lr": 1.1135193690626926e-06, "epoch": 1.3939980638915779, "percentage": 69.7, "elapsed_time": "1:19:27", "remaining_time": "0:34:32"}
163
+ {"current_steps": 1450, "total_steps": 2066, "loss": 1.0262, "lr": 1.0810789294296397e-06, "epoch": 1.4036786060019362, "percentage": 70.18, "elapsed_time": "1:20:02", "remaining_time": "0:34:00"}
164
+ {"current_steps": 1460, "total_steps": 2066, "loss": 0.9745, "lr": 1.048987191167398e-06, "epoch": 1.4133591481122942, "percentage": 70.67, "elapsed_time": "1:20:35", "remaining_time": "0:33:26"}
165
+ {"current_steps": 1470, "total_steps": 2066, "loss": 0.9759, "lr": 1.0172520408549716e-06, "epoch": 1.4230396902226525, "percentage": 71.15, "elapsed_time": "1:21:11", "remaining_time": "0:32:54"}
166
+ {"current_steps": 1480, "total_steps": 2066, "loss": 1.0117, "lr": 9.858812774394946e-07, "epoch": 1.4327202323330106, "percentage": 71.64, "elapsed_time": "1:21:43", "remaining_time": "0:32:21"}
167
+ {"current_steps": 1490, "total_steps": 2066, "loss": 0.9736, "lr": 9.548826103196304e-07, "epoch": 1.442400774443369, "percentage": 72.12, "elapsed_time": "1:22:16", "remaining_time": "0:31:48"}
168
+ {"current_steps": 1500, "total_steps": 2066, "loss": 1.002, "lr": 9.242636574509828e-07, "epoch": 1.452081316553727, "percentage": 72.6, "elapsed_time": "1:22:50", "remaining_time": "0:31:15"}
169
+ {"current_steps": 1510, "total_steps": 2066, "loss": 1.0391, "lr": 8.940319434739683e-07, "epoch": 1.4617618586640853, "percentage": 73.09, "elapsed_time": "1:24:32", "remaining_time": "0:31:07"}
170
+ {"current_steps": 1520, "total_steps": 2066, "loss": 0.9864, "lr": 8.641948978646361e-07, "epoch": 1.4714424007744433, "percentage": 73.57, "elapsed_time": "1:25:10", "remaining_time": "0:30:35"}
171
+ {"current_steps": 1530, "total_steps": 2066, "loss": 1.0425, "lr": 8.347598531088555e-07, "epoch": 1.4811229428848016, "percentage": 74.06, "elapsed_time": "1:25:45", "remaining_time": "0:30:02"}
172
+ {"current_steps": 1540, "total_steps": 2066, "loss": 1.0028, "lr": 8.05734042900363e-07, "epoch": 1.4908034849951597, "percentage": 74.54, "elapsed_time": "1:26:18", "remaining_time": "0:29:28"}
173
+ {"current_steps": 1550, "total_steps": 2066, "loss": 0.9764, "lr": 7.771246003630625e-07, "epoch": 1.500484027105518, "percentage": 75.02, "elapsed_time": "1:26:50", "remaining_time": "0:28:54"}
174
+ {"current_steps": 1560, "total_steps": 2066, "loss": 0.9658, "lr": 7.489385562980589e-07, "epoch": 1.510164569215876, "percentage": 75.51, "elapsed_time": "1:27:25", "remaining_time": "0:28:21"}
175
+ {"current_steps": 1570, "total_steps": 2066, "loss": 0.9621, "lr": 7.211828374558311e-07, "epoch": 1.5198451113262341, "percentage": 75.99, "elapsed_time": "1:27:56", "remaining_time": "0:27:47"}
176
+ {"current_steps": 1580, "total_steps": 2066, "loss": 0.9874, "lr": 6.938642648339719e-07, "epoch": 1.5295256534365924, "percentage": 76.48, "elapsed_time": "1:28:29", "remaining_time": "0:27:13"}
177
+ {"current_steps": 1590, "total_steps": 2066, "loss": 0.9481, "lr": 6.669895520009239e-07, "epoch": 1.5392061955469507, "percentage": 76.96, "elapsed_time": "1:29:02", "remaining_time": "0:26:39"}
178
+ {"current_steps": 1600, "total_steps": 2066, "loss": 0.9555, "lr": 6.405653034461115e-07, "epoch": 1.5488867376573088, "percentage": 77.44, "elapsed_time": "1:29:36", "remaining_time": "0:26:06"}
179
+ {"current_steps": 1600, "total_steps": 2066, "eval_loss": 1.3306459188461304, "epoch": 1.5488867376573088, "percentage": 77.44, "elapsed_time": "1:29:47", "remaining_time": "0:26:09"}
180
+ {"current_steps": 1610, "total_steps": 2066, "loss": 1.0002, "lr": 6.145980129568823e-07, "epoch": 1.5585672797676668, "percentage": 77.93, "elapsed_time": "1:31:24", "remaining_time": "0:25:53"}
181
+ {"current_steps": 1620, "total_steps": 2066, "loss": 1.0028, "lr": 5.890940620226479e-07, "epoch": 1.5682478218780251, "percentage": 78.41, "elapsed_time": "1:31:57", "remaining_time": "0:25:19"}
182
+ {"current_steps": 1630, "total_steps": 2066, "loss": 0.9734, "lr": 5.640597182666324e-07, "epoch": 1.5779283639883834, "percentage": 78.9, "elapsed_time": "1:32:29", "remaining_time": "0:24:44"}
183
+ {"current_steps": 1640, "total_steps": 2066, "loss": 0.976, "lr": 5.395011339055886e-07, "epoch": 1.5876089060987415, "percentage": 79.38, "elapsed_time": "1:33:05", "remaining_time": "0:24:10"}
184
+ {"current_steps": 1650, "total_steps": 2066, "loss": 0.9662, "lr": 5.154243442378934e-07, "epoch": 1.5972894482090996, "percentage": 79.86, "elapsed_time": "1:33:38", "remaining_time": "0:23:36"}
185
+ {"current_steps": 1660, "total_steps": 2066, "loss": 1.0096, "lr": 4.918352661603604e-07, "epoch": 1.6069699903194579, "percentage": 80.35, "elapsed_time": "1:34:15", "remaining_time": "0:23:03"}
186
+ {"current_steps": 1670, "total_steps": 2066, "loss": 0.9944, "lr": 4.687396967141583e-07, "epoch": 1.6166505324298162, "percentage": 80.83, "elapsed_time": "1:34:52", "remaining_time": "0:22:29"}
187
+ {"current_steps": 1680, "total_steps": 2066, "loss": 0.9976, "lr": 4.4614331166018403e-07, "epoch": 1.6263310745401742, "percentage": 81.32, "elapsed_time": "1:35:28", "remaining_time": "0:21:56"}
188
+ {"current_steps": 1690, "total_steps": 2066, "loss": 0.9404, "lr": 4.2405166408423154e-07, "epoch": 1.6360116166505323, "percentage": 81.8, "elapsed_time": "1:36:00", "remaining_time": "0:21:21"}
189
+ {"current_steps": 1700, "total_steps": 2066, "loss": 0.9871, "lr": 4.0247018303232437e-07, "epoch": 1.6456921587608906, "percentage": 82.28, "elapsed_time": "1:36:35", "remaining_time": "0:20:47"}
190
+ {"current_steps": 1710, "total_steps": 2066, "loss": 0.9848, "lr": 3.8140417217651437e-07, "epoch": 1.6553727008712489, "percentage": 82.77, "elapsed_time": "1:38:17", "remaining_time": "0:20:27"}
191
+ {"current_steps": 1720, "total_steps": 2066, "loss": 1.0306, "lr": 3.608588085115028e-07, "epoch": 1.665053242981607, "percentage": 83.25, "elapsed_time": "1:39:07", "remaining_time": "0:19:56"}
192
+ {"current_steps": 1730, "total_steps": 2066, "loss": 0.9356, "lr": 3.408391410823864e-07, "epoch": 1.674733785091965, "percentage": 83.74, "elapsed_time": "1:39:51", "remaining_time": "0:19:23"}
193
+ {"current_steps": 1740, "total_steps": 2066, "loss": 0.9694, "lr": 3.213500897438487e-07, "epoch": 1.6844143272023233, "percentage": 84.22, "elapsed_time": "1:40:24", "remaining_time": "0:18:48"}
194
+ {"current_steps": 1750, "total_steps": 2066, "loss": 0.9816, "lr": 3.023964439511026e-07, "epoch": 1.6940948693126816, "percentage": 84.7, "elapsed_time": "1:41:02", "remaining_time": "0:18:14"}
195
+ {"current_steps": 1760, "total_steps": 2066, "loss": 1.0213, "lr": 2.839828615828744e-07, "epoch": 1.7037754114230397, "percentage": 85.19, "elapsed_time": "1:41:42", "remaining_time": "0:17:41"}
196
+ {"current_steps": 1770, "total_steps": 2066, "loss": 0.9737, "lr": 2.6611386779672786e-07, "epoch": 1.7134559535333977, "percentage": 85.67, "elapsed_time": "1:42:28", "remaining_time": "0:17:08"}
197
+ {"current_steps": 1780, "total_steps": 2066, "loss": 0.9567, "lr": 2.487938539169982e-07, "epoch": 1.723136495643756, "percentage": 86.16, "elapsed_time": "1:43:02", "remaining_time": "0:16:33"}
198
+ {"current_steps": 1790, "total_steps": 2066, "loss": 0.9943, "lr": 2.3202707635562371e-07, "epoch": 1.7328170377541143, "percentage": 86.64, "elapsed_time": "1:43:39", "remaining_time": "0:15:58"}
199
+ {"current_steps": 1800, "total_steps": 2066, "loss": 0.9866, "lr": 2.1581765556612233e-07, "epoch": 1.7424975798644724, "percentage": 87.12, "elapsed_time": "1:44:12", "remaining_time": "0:15:24"}
200
+ {"current_steps": 1800, "total_steps": 2066, "eval_loss": 1.3297312259674072, "epoch": 1.7424975798644724, "percentage": 87.12, "elapsed_time": "1:44:25", "remaining_time": "0:15:25"}
201
+ {"current_steps": 1810, "total_steps": 2066, "loss": 0.9948, "lr": 2.001695750309926e-07, "epoch": 1.7521781219748305, "percentage": 87.61, "elapsed_time": "1:46:02", "remaining_time": "0:14:59"}
202
+ {"current_steps": 1820, "total_steps": 2066, "loss": 0.9888, "lr": 1.8508668028276305e-07, "epoch": 1.7618586640851888, "percentage": 88.09, "elapsed_time": "1:46:37", "remaining_time": "0:14:24"}
203
+ {"current_steps": 1830, "total_steps": 2066, "loss": 1.0625, "lr": 1.7057267795895117e-07, "epoch": 1.771539206195547, "percentage": 88.58, "elapsed_time": "1:47:14", "remaining_time": "0:13:49"}
204
+ {"current_steps": 1840, "total_steps": 2066, "loss": 0.9834, "lr": 1.566311348911534e-07, "epoch": 1.7812197483059051, "percentage": 89.06, "elapsed_time": "1:47:47", "remaining_time": "0:13:14"}
205
+ {"current_steps": 1850, "total_steps": 2066, "loss": 0.9507, "lr": 1.4326547722848972e-07, "epoch": 1.7909002904162632, "percentage": 89.55, "elapsed_time": "1:48:19", "remaining_time": "0:12:38"}
206
+ {"current_steps": 1860, "total_steps": 2066, "loss": 0.9997, "lr": 1.3047898959562767e-07, "epoch": 1.8005808325266215, "percentage": 90.03, "elapsed_time": "1:48:56", "remaining_time": "0:12:03"}
207
+ {"current_steps": 1870, "total_steps": 2066, "loss": 0.9919, "lr": 1.1827481428557969e-07, "epoch": 1.8102613746369798, "percentage": 90.51, "elapsed_time": "1:49:47", "remaining_time": "0:11:30"}
208
+ {"current_steps": 1880, "total_steps": 2066, "loss": 0.9774, "lr": 1.0665595048748257e-07, "epoch": 1.8199419167473379, "percentage": 91.0, "elapsed_time": "1:50:29", "remaining_time": "0:10:55"}
209
+ {"current_steps": 1890, "total_steps": 2066, "loss": 1.0053, "lr": 9.562525354954194e-08, "epoch": 1.829622458857696, "percentage": 91.48, "elapsed_time": "1:51:11", "remaining_time": "0:10:21"}
210
+ {"current_steps": 1900, "total_steps": 2066, "loss": 0.9824, "lr": 8.518543427732951e-08, "epoch": 1.8393030009680542, "percentage": 91.97, "elapsed_time": "1:51:47", "remaining_time": "0:09:46"}
211
+ {"current_steps": 1910, "total_steps": 2066, "loss": 0.9876, "lr": 7.53390582675978e-08, "epoch": 1.8489835430784125, "percentage": 92.45, "elapsed_time": "1:53:24", "remaining_time": "0:09:15"}
212
+ {"current_steps": 1920, "total_steps": 2066, "loss": 0.9587, "lr": 6.608854527778319e-08, "epoch": 1.8586640851887706, "percentage": 92.93, "elapsed_time": "1:54:05", "remaining_time": "0:08:40"}
213
+ {"current_steps": 1930, "total_steps": 2066, "loss": 0.9771, "lr": 5.743616863134793e-08, "epoch": 1.8683446272991286, "percentage": 93.42, "elapsed_time": "1:54:39", "remaining_time": "0:08:04"}
214
+ {"current_steps": 1940, "total_steps": 2066, "loss": 0.9591, "lr": 4.938405465910706e-08, "epoch": 1.878025169409487, "percentage": 93.9, "elapsed_time": "1:55:09", "remaining_time": "0:07:28"}
215
+ {"current_steps": 1950, "total_steps": 2066, "loss": 0.9684, "lr": 4.193418217668305e-08, "epoch": 1.8877057115198452, "percentage": 94.39, "elapsed_time": "1:55:42", "remaining_time": "0:06:52"}
216
+ {"current_steps": 1960, "total_steps": 2066, "loss": 0.9637, "lr": 3.508838199820591e-08, "epoch": 1.8973862536302033, "percentage": 94.87, "elapsed_time": "1:56:14", "remaining_time": "0:06:17"}
217
+ {"current_steps": 1970, "total_steps": 2066, "loss": 1.0192, "lr": 2.884833648639257e-08, "epoch": 1.9070667957405614, "percentage": 95.35, "elapsed_time": "1:56:48", "remaining_time": "0:05:41"}
218
+ {"current_steps": 1980, "total_steps": 2066, "loss": 0.9587, "lr": 2.3215579139101996e-08, "epoch": 1.9167473378509197, "percentage": 95.84, "elapsed_time": "1:57:20", "remaining_time": "0:05:05"}
219
+ {"current_steps": 1990, "total_steps": 2066, "loss": 0.9727, "lr": 1.8191494212477513e-08, "epoch": 1.926427879961278, "percentage": 96.32, "elapsed_time": "1:57:53", "remaining_time": "0:04:30"}
220
+ {"current_steps": 2000, "total_steps": 2066, "loss": 0.9891, "lr": 1.3777316380763073e-08, "epoch": 1.936108422071636, "percentage": 96.81, "elapsed_time": "1:58:27", "remaining_time": "0:03:54"}
221
+ {"current_steps": 2000, "total_steps": 2066, "eval_loss": 1.328829288482666, "epoch": 1.936108422071636, "percentage": 96.81, "elapsed_time": "1:58:38", "remaining_time": "0:03:54"}
222
+ {"current_steps": 2010, "total_steps": 2066, "loss": 0.9352, "lr": 9.9741304328832e-09, "epoch": 1.945788964181994, "percentage": 97.29, "elapsed_time": "2:00:15", "remaining_time": "0:03:21"}
223
+ {"current_steps": 2020, "total_steps": 2066, "loss": 0.9591, "lr": 6.782871005851788e-09, "epoch": 1.9554695062923524, "percentage": 97.77, "elapsed_time": "2:00:49", "remaining_time": "0:02:45"}
224
+ {"current_steps": 2030, "total_steps": 2066, "loss": 0.9201, "lr": 4.2043223550869425e-09, "epoch": 1.9651500484027107, "percentage": 98.26, "elapsed_time": "2:01:20", "remaining_time": "0:02:09"}
225
+ {"current_steps": 2040, "total_steps": 2066, "loss": 0.9517, "lr": 2.239118161677656e-09, "epoch": 1.9748305905130688, "percentage": 98.74, "elapsed_time": "2:01:53", "remaining_time": "0:01:33"}
226
+ {"current_steps": 2050, "total_steps": 2066, "loss": 0.9299, "lr": 8.877413766561482e-10, "epoch": 1.9845111326234268, "percentage": 99.23, "elapsed_time": "2:02:31", "remaining_time": "0:00:57"}
227
+ {"current_steps": 2060, "total_steps": 2066, "loss": 0.9732, "lr": 1.5052410231336522e-10, "epoch": 1.9941916747337851, "percentage": 99.71, "elapsed_time": "2:03:11", "remaining_time": "0:00:21"}
228
+ {"current_steps": 2066, "total_steps": 2066, "epoch": 2.0, "percentage": 100.0, "elapsed_time": "2:04:34", "remaining_time": "0:00:00"}
trainer_state.json ADDED
@@ -0,0 +1,1565 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 2.0,
6
+ "eval_steps": 200,
7
+ "global_step": 2066,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.00968054211035818,
14
+ "grad_norm": 4.820261516808518,
15
+ "learning_rate": 7.258064516129033e-07,
16
+ "loss": 1.7236,
17
+ "step": 10
18
+ },
19
+ {
20
+ "epoch": 0.01936108422071636,
21
+ "grad_norm": 3.626062262613067,
22
+ "learning_rate": 1.5322580645161292e-06,
23
+ "loss": 1.627,
24
+ "step": 20
25
+ },
26
+ {
27
+ "epoch": 0.02904162633107454,
28
+ "grad_norm": 2.5817816313172064,
29
+ "learning_rate": 2.338709677419355e-06,
30
+ "loss": 1.4318,
31
+ "step": 30
32
+ },
33
+ {
34
+ "epoch": 0.03872216844143272,
35
+ "grad_norm": 2.120575175645984,
36
+ "learning_rate": 3.145161290322581e-06,
37
+ "loss": 1.3885,
38
+ "step": 40
39
+ },
40
+ {
41
+ "epoch": 0.0484027105517909,
42
+ "grad_norm": 2.3529433456721085,
43
+ "learning_rate": 3.951612903225807e-06,
44
+ "loss": 1.3955,
45
+ "step": 50
46
+ },
47
+ {
48
+ "epoch": 0.05808325266214908,
49
+ "grad_norm": 2.0836995501157407,
50
+ "learning_rate": 4.758064516129033e-06,
51
+ "loss": 1.2718,
52
+ "step": 60
53
+ },
54
+ {
55
+ "epoch": 0.06776379477250725,
56
+ "grad_norm": 2.2076440974932288,
57
+ "learning_rate": 4.999849475897687e-06,
58
+ "loss": 1.3654,
59
+ "step": 70
60
+ },
61
+ {
62
+ "epoch": 0.07744433688286544,
63
+ "grad_norm": 2.0109935662674743,
64
+ "learning_rate": 4.999112258623345e-06,
65
+ "loss": 1.2831,
66
+ "step": 80
67
+ },
68
+ {
69
+ "epoch": 0.08712487899322362,
70
+ "grad_norm": 2.137200234785965,
71
+ "learning_rate": 4.997760881838323e-06,
72
+ "loss": 1.3002,
73
+ "step": 90
74
+ },
75
+ {
76
+ "epoch": 0.0968054211035818,
77
+ "grad_norm": 2.0451930107997325,
78
+ "learning_rate": 4.995795677644913e-06,
79
+ "loss": 1.287,
80
+ "step": 100
81
+ },
82
+ {
83
+ "epoch": 0.10648596321393998,
84
+ "grad_norm": 2.018997434809762,
85
+ "learning_rate": 4.993217128994149e-06,
86
+ "loss": 1.2492,
87
+ "step": 110
88
+ },
89
+ {
90
+ "epoch": 0.11616650532429816,
91
+ "grad_norm": 2.08777196120052,
92
+ "learning_rate": 4.9900258695671176e-06,
93
+ "loss": 1.2794,
94
+ "step": 120
95
+ },
96
+ {
97
+ "epoch": 0.12584704743465633,
98
+ "grad_norm": 2.224636401575338,
99
+ "learning_rate": 4.986222683619237e-06,
100
+ "loss": 1.2506,
101
+ "step": 130
102
+ },
103
+ {
104
+ "epoch": 0.1355275895450145,
105
+ "grad_norm": 1.7965192828814696,
106
+ "learning_rate": 4.981808505787523e-06,
107
+ "loss": 1.2609,
108
+ "step": 140
109
+ },
110
+ {
111
+ "epoch": 0.1452081316553727,
112
+ "grad_norm": 1.8146840191247664,
113
+ "learning_rate": 4.976784420860898e-06,
114
+ "loss": 1.2329,
115
+ "step": 150
116
+ },
117
+ {
118
+ "epoch": 0.15488867376573087,
119
+ "grad_norm": 2.1056167196365423,
120
+ "learning_rate": 4.971151663513608e-06,
121
+ "loss": 1.3551,
122
+ "step": 160
123
+ },
124
+ {
125
+ "epoch": 0.16456921587608905,
126
+ "grad_norm": 1.969170099552798,
127
+ "learning_rate": 4.964911618001794e-06,
128
+ "loss": 1.261,
129
+ "step": 170
130
+ },
131
+ {
132
+ "epoch": 0.17424975798644723,
133
+ "grad_norm": 1.7282575615370996,
134
+ "learning_rate": 4.958065817823318e-06,
135
+ "loss": 1.2055,
136
+ "step": 180
137
+ },
138
+ {
139
+ "epoch": 0.18393030009680542,
140
+ "grad_norm": 2.263093065284853,
141
+ "learning_rate": 4.950615945340893e-06,
142
+ "loss": 1.3022,
143
+ "step": 190
144
+ },
145
+ {
146
+ "epoch": 0.1936108422071636,
147
+ "grad_norm": 1.947338641667041,
148
+ "learning_rate": 4.942563831368653e-06,
149
+ "loss": 1.2701,
150
+ "step": 200
151
+ },
152
+ {
153
+ "epoch": 0.1936108422071636,
154
+ "eval_loss": 1.3192224502563477,
155
+ "eval_runtime": 11.1204,
156
+ "eval_samples_per_second": 60.07,
157
+ "eval_steps_per_second": 3.777,
158
+ "step": 200
159
+ },
160
+ {
161
+ "epoch": 0.20329138431752178,
162
+ "grad_norm": 2.3763373561887176,
163
+ "learning_rate": 4.933911454722217e-06,
164
+ "loss": 1.277,
165
+ "step": 210
166
+ },
167
+ {
168
+ "epoch": 0.21297192642787996,
169
+ "grad_norm": 1.8476884141723375,
170
+ "learning_rate": 4.924660941732403e-06,
171
+ "loss": 1.2418,
172
+ "step": 220
173
+ },
174
+ {
175
+ "epoch": 0.22265246853823814,
176
+ "grad_norm": 2.088709216532341,
177
+ "learning_rate": 4.914814565722671e-06,
178
+ "loss": 1.294,
179
+ "step": 230
180
+ },
181
+ {
182
+ "epoch": 0.23233301064859632,
183
+ "grad_norm": 1.9726270845953118,
184
+ "learning_rate": 4.9043747464504586e-06,
185
+ "loss": 1.2823,
186
+ "step": 240
187
+ },
188
+ {
189
+ "epoch": 0.2420135527589545,
190
+ "grad_norm": 2.153285232108024,
191
+ "learning_rate": 4.893344049512519e-06,
192
+ "loss": 1.2753,
193
+ "step": 250
194
+ },
195
+ {
196
+ "epoch": 0.25169409486931266,
197
+ "grad_norm": 1.698193104908672,
198
+ "learning_rate": 4.881725185714421e-06,
199
+ "loss": 1.1851,
200
+ "step": 260
201
+ },
202
+ {
203
+ "epoch": 0.26137463697967084,
204
+ "grad_norm": 2.2926805502010166,
205
+ "learning_rate": 4.869521010404373e-06,
206
+ "loss": 1.2901,
207
+ "step": 270
208
+ },
209
+ {
210
+ "epoch": 0.271055179090029,
211
+ "grad_norm": 1.93897800592033,
212
+ "learning_rate": 4.856734522771512e-06,
213
+ "loss": 1.246,
214
+ "step": 280
215
+ },
216
+ {
217
+ "epoch": 0.2807357212003872,
218
+ "grad_norm": 1.9602823265085942,
219
+ "learning_rate": 4.843368865108847e-06,
220
+ "loss": 1.204,
221
+ "step": 290
222
+ },
223
+ {
224
+ "epoch": 0.2904162633107454,
225
+ "grad_norm": 2.002239339436152,
226
+ "learning_rate": 4.8294273220410494e-06,
227
+ "loss": 1.271,
228
+ "step": 300
229
+ },
230
+ {
231
+ "epoch": 0.30009680542110356,
232
+ "grad_norm": 1.9017334477170922,
233
+ "learning_rate": 4.814913319717238e-06,
234
+ "loss": 1.307,
235
+ "step": 310
236
+ },
237
+ {
238
+ "epoch": 0.30977734753146174,
239
+ "grad_norm": 2.042693902021796,
240
+ "learning_rate": 4.799830424969008e-06,
241
+ "loss": 1.273,
242
+ "step": 320
243
+ },
244
+ {
245
+ "epoch": 0.3194578896418199,
246
+ "grad_norm": 1.948504976997528,
247
+ "learning_rate": 4.784182344433878e-06,
248
+ "loss": 1.2719,
249
+ "step": 330
250
+ },
251
+ {
252
+ "epoch": 0.3291384317521781,
253
+ "grad_norm": 2.083841470170843,
254
+ "learning_rate": 4.767972923644377e-06,
255
+ "loss": 1.2732,
256
+ "step": 340
257
+ },
258
+ {
259
+ "epoch": 0.3388189738625363,
260
+ "grad_norm": 2.1919169498570055,
261
+ "learning_rate": 4.751206146083002e-06,
262
+ "loss": 1.3289,
263
+ "step": 350
264
+ },
265
+ {
266
+ "epoch": 0.34849951597289447,
267
+ "grad_norm": 1.9533269346119766,
268
+ "learning_rate": 4.7338861322032724e-06,
269
+ "loss": 1.2303,
270
+ "step": 360
271
+ },
272
+ {
273
+ "epoch": 0.35818005808325265,
274
+ "grad_norm": 1.9672032803697441,
275
+ "learning_rate": 4.716017138417126e-06,
276
+ "loss": 1.1788,
277
+ "step": 370
278
+ },
279
+ {
280
+ "epoch": 0.36786060019361083,
281
+ "grad_norm": 1.7493745751363814,
282
+ "learning_rate": 4.697603556048899e-06,
283
+ "loss": 1.2543,
284
+ "step": 380
285
+ },
286
+ {
287
+ "epoch": 0.377541142303969,
288
+ "grad_norm": 1.8808999956966037,
289
+ "learning_rate": 4.6786499102561525e-06,
290
+ "loss": 1.3091,
291
+ "step": 390
292
+ },
293
+ {
294
+ "epoch": 0.3872216844143272,
295
+ "grad_norm": 1.8576232558705237,
296
+ "learning_rate": 4.659160858917614e-06,
297
+ "loss": 1.2693,
298
+ "step": 400
299
+ },
300
+ {
301
+ "epoch": 0.3872216844143272,
302
+ "eval_loss": 1.3101810216903687,
303
+ "eval_runtime": 11.0364,
304
+ "eval_samples_per_second": 60.527,
305
+ "eval_steps_per_second": 3.806,
306
+ "step": 400
307
+ },
308
+ {
309
+ "epoch": 0.3969022265246854,
310
+ "grad_norm": 2.169929911455977,
311
+ "learning_rate": 4.639141191488498e-06,
312
+ "loss": 1.2866,
313
+ "step": 410
314
+ },
315
+ {
316
+ "epoch": 0.40658276863504356,
317
+ "grad_norm": 1.903451878169039,
318
+ "learning_rate": 4.618595827823486e-06,
319
+ "loss": 1.3088,
320
+ "step": 420
321
+ },
322
+ {
323
+ "epoch": 0.41626331074540174,
324
+ "grad_norm": 1.8380999378945895,
325
+ "learning_rate": 4.597529816967676e-06,
326
+ "loss": 1.2445,
327
+ "step": 430
328
+ },
329
+ {
330
+ "epoch": 0.4259438528557599,
331
+ "grad_norm": 1.794994240260647,
332
+ "learning_rate": 4.575948335915769e-06,
333
+ "loss": 1.2679,
334
+ "step": 440
335
+ },
336
+ {
337
+ "epoch": 0.4356243949661181,
338
+ "grad_norm": 1.8131554887283838,
339
+ "learning_rate": 4.553856688339817e-06,
340
+ "loss": 1.2699,
341
+ "step": 450
342
+ },
343
+ {
344
+ "epoch": 0.4453049370764763,
345
+ "grad_norm": 1.822749339089404,
346
+ "learning_rate": 4.531260303285841e-06,
347
+ "loss": 1.2381,
348
+ "step": 460
349
+ },
350
+ {
351
+ "epoch": 0.45498547918683446,
352
+ "grad_norm": 1.7488634444209434,
353
+ "learning_rate": 4.50816473383964e-06,
354
+ "loss": 1.3089,
355
+ "step": 470
356
+ },
357
+ {
358
+ "epoch": 0.46466602129719264,
359
+ "grad_norm": 1.8572372415183604,
360
+ "learning_rate": 4.484575655762107e-06,
361
+ "loss": 1.2271,
362
+ "step": 480
363
+ },
364
+ {
365
+ "epoch": 0.4743465634075508,
366
+ "grad_norm": 2.072794841066779,
367
+ "learning_rate": 4.460498866094412e-06,
368
+ "loss": 1.2136,
369
+ "step": 490
370
+ },
371
+ {
372
+ "epoch": 0.484027105517909,
373
+ "grad_norm": 1.7764723909580904,
374
+ "learning_rate": 4.435940281733369e-06,
375
+ "loss": 1.2747,
376
+ "step": 500
377
+ },
378
+ {
379
+ "epoch": 0.4937076476282672,
380
+ "grad_norm": 2.0917776639027914,
381
+ "learning_rate": 4.410905937977353e-06,
382
+ "loss": 1.265,
383
+ "step": 510
384
+ },
385
+ {
386
+ "epoch": 0.5033881897386253,
387
+ "grad_norm": 2.0260759502040786,
388
+ "learning_rate": 4.385401987043118e-06,
389
+ "loss": 1.2895,
390
+ "step": 520
391
+ },
392
+ {
393
+ "epoch": 0.5130687318489835,
394
+ "grad_norm": 1.9734465330232784,
395
+ "learning_rate": 4.359434696553889e-06,
396
+ "loss": 1.2376,
397
+ "step": 530
398
+ },
399
+ {
400
+ "epoch": 0.5227492739593417,
401
+ "grad_norm": 1.7584757436973566,
402
+ "learning_rate": 4.333010447999077e-06,
403
+ "loss": 1.2575,
404
+ "step": 540
405
+ },
406
+ {
407
+ "epoch": 0.5324298160696999,
408
+ "grad_norm": 1.9187648320497177,
409
+ "learning_rate": 4.3061357351660285e-06,
410
+ "loss": 1.267,
411
+ "step": 550
412
+ },
413
+ {
414
+ "epoch": 0.542110358180058,
415
+ "grad_norm": 1.7622412801152667,
416
+ "learning_rate": 4.27881716254417e-06,
417
+ "loss": 1.2584,
418
+ "step": 560
419
+ },
420
+ {
421
+ "epoch": 0.5517909002904162,
422
+ "grad_norm": 1.9661817706505653,
423
+ "learning_rate": 4.251061443701941e-06,
424
+ "loss": 1.2263,
425
+ "step": 570
426
+ },
427
+ {
428
+ "epoch": 0.5614714424007744,
429
+ "grad_norm": 1.8276707656109445,
430
+ "learning_rate": 4.222875399636938e-06,
431
+ "loss": 1.2231,
432
+ "step": 580
433
+ },
434
+ {
435
+ "epoch": 0.5711519845111326,
436
+ "grad_norm": 2.04251694462628,
437
+ "learning_rate": 4.194265957099638e-06,
438
+ "loss": 1.2656,
439
+ "step": 590
440
+ },
441
+ {
442
+ "epoch": 0.5808325266214908,
443
+ "grad_norm": 1.6830527245888396,
444
+ "learning_rate": 4.165240146891145e-06,
445
+ "loss": 1.2341,
446
+ "step": 600
447
+ },
448
+ {
449
+ "epoch": 0.5808325266214908,
450
+ "eval_loss": 1.3036646842956543,
451
+ "eval_runtime": 11.0157,
452
+ "eval_samples_per_second": 60.641,
453
+ "eval_steps_per_second": 3.813,
454
+ "step": 600
455
+ },
456
+ {
457
+ "epoch": 0.590513068731849,
458
+ "grad_norm": 1.8700224183798644,
459
+ "learning_rate": 4.1358051021353655e-06,
460
+ "loss": 1.2413,
461
+ "step": 610
462
+ },
463
+ {
464
+ "epoch": 0.6001936108422071,
465
+ "grad_norm": 2.025528365164639,
466
+ "learning_rate": 4.1059680565260315e-06,
467
+ "loss": 1.2342,
468
+ "step": 620
469
+ },
470
+ {
471
+ "epoch": 0.6098741529525653,
472
+ "grad_norm": 1.7543494271951428,
473
+ "learning_rate": 4.0757363425490185e-06,
474
+ "loss": 1.1899,
475
+ "step": 630
476
+ },
477
+ {
478
+ "epoch": 0.6195546950629235,
479
+ "grad_norm": 1.8857378829308964,
480
+ "learning_rate": 4.04511738968037e-06,
481
+ "loss": 1.1912,
482
+ "step": 640
483
+ },
484
+ {
485
+ "epoch": 0.6292352371732817,
486
+ "grad_norm": 1.8763965346178748,
487
+ "learning_rate": 4.0141187225605064e-06,
488
+ "loss": 1.2066,
489
+ "step": 650
490
+ },
491
+ {
492
+ "epoch": 0.6389157792836399,
493
+ "grad_norm": 1.7240907552108375,
494
+ "learning_rate": 3.98274795914503e-06,
495
+ "loss": 1.2394,
496
+ "step": 660
497
+ },
498
+ {
499
+ "epoch": 0.648596321393998,
500
+ "grad_norm": 1.7995970571853088,
501
+ "learning_rate": 3.951012808832603e-06,
502
+ "loss": 1.2069,
503
+ "step": 670
504
+ },
505
+ {
506
+ "epoch": 0.6582768635043562,
507
+ "grad_norm": 2.0416216909511355,
508
+ "learning_rate": 3.918921070570361e-06,
509
+ "loss": 1.2724,
510
+ "step": 680
511
+ },
512
+ {
513
+ "epoch": 0.6679574056147144,
514
+ "grad_norm": 1.8831005612989327,
515
+ "learning_rate": 3.886480630937307e-06,
516
+ "loss": 1.3105,
517
+ "step": 690
518
+ },
519
+ {
520
+ "epoch": 0.6776379477250726,
521
+ "grad_norm": 1.8563831558700061,
522
+ "learning_rate": 3.853699462206183e-06,
523
+ "loss": 1.1989,
524
+ "step": 700
525
+ },
526
+ {
527
+ "epoch": 0.6873184898354308,
528
+ "grad_norm": 1.642368767478371,
529
+ "learning_rate": 3.820585620384265e-06,
530
+ "loss": 1.3256,
531
+ "step": 710
532
+ },
533
+ {
534
+ "epoch": 0.6969990319457889,
535
+ "grad_norm": 1.8190881361027762,
536
+ "learning_rate": 3.787147243233602e-06,
537
+ "loss": 1.2206,
538
+ "step": 720
539
+ },
540
+ {
541
+ "epoch": 0.7066795740561471,
542
+ "grad_norm": 1.9867148410918392,
543
+ "learning_rate": 3.753392548271144e-06,
544
+ "loss": 1.2245,
545
+ "step": 730
546
+ },
547
+ {
548
+ "epoch": 0.7163601161665053,
549
+ "grad_norm": 1.6456620597756189,
550
+ "learning_rate": 3.7193298307492855e-06,
551
+ "loss": 1.2685,
552
+ "step": 740
553
+ },
554
+ {
555
+ "epoch": 0.7260406582768635,
556
+ "grad_norm": 1.7765857550257254,
557
+ "learning_rate": 3.6849674616172887e-06,
558
+ "loss": 1.2379,
559
+ "step": 750
560
+ },
561
+ {
562
+ "epoch": 0.7357212003872217,
563
+ "grad_norm": 1.7474726903371232,
564
+ "learning_rate": 3.6503138854641257e-06,
565
+ "loss": 1.2176,
566
+ "step": 760
567
+ },
568
+ {
569
+ "epoch": 0.7454017424975798,
570
+ "grad_norm": 1.6559737928060483,
571
+ "learning_rate": 3.615377618443201e-06,
572
+ "loss": 1.2751,
573
+ "step": 770
574
+ },
575
+ {
576
+ "epoch": 0.755082284607938,
577
+ "grad_norm": 1.6608737876417763,
578
+ "learning_rate": 3.5801672461795032e-06,
579
+ "loss": 1.2335,
580
+ "step": 780
581
+ },
582
+ {
583
+ "epoch": 0.7647628267182962,
584
+ "grad_norm": 1.777841876809436,
585
+ "learning_rate": 3.5446914216596805e-06,
586
+ "loss": 1.2816,
587
+ "step": 790
588
+ },
589
+ {
590
+ "epoch": 0.7744433688286544,
591
+ "grad_norm": 1.7176497633610424,
592
+ "learning_rate": 3.5089588631055527e-06,
593
+ "loss": 1.1997,
594
+ "step": 800
595
+ },
596
+ {
597
+ "epoch": 0.7744433688286544,
598
+ "eval_loss": 1.2973461151123047,
599
+ "eval_runtime": 11.1321,
600
+ "eval_samples_per_second": 60.006,
601
+ "eval_steps_per_second": 3.773,
602
+ "step": 800
603
+ },
604
+ {
605
+ "epoch": 0.7841239109390126,
606
+ "grad_norm": 1.68764869861233,
607
+ "learning_rate": 3.472978351831606e-06,
608
+ "loss": 1.2153,
609
+ "step": 810
610
+ },
611
+ {
612
+ "epoch": 0.7938044530493708,
613
+ "grad_norm": 1.8236059508367688,
614
+ "learning_rate": 3.436758730086971e-06,
615
+ "loss": 1.1981,
616
+ "step": 820
617
+ },
618
+ {
619
+ "epoch": 0.8034849951597289,
620
+ "grad_norm": 2.095607905740359,
621
+ "learning_rate": 3.4003088988824323e-06,
622
+ "loss": 1.2271,
623
+ "step": 830
624
+ },
625
+ {
626
+ "epoch": 0.8131655372700871,
627
+ "grad_norm": 2.2791939009599282,
628
+ "learning_rate": 3.363637815802998e-06,
629
+ "loss": 1.2394,
630
+ "step": 840
631
+ },
632
+ {
633
+ "epoch": 0.8228460793804453,
634
+ "grad_norm": 1.6243179467285545,
635
+ "learning_rate": 3.326754492806559e-06,
636
+ "loss": 1.2334,
637
+ "step": 850
638
+ },
639
+ {
640
+ "epoch": 0.8325266214908035,
641
+ "grad_norm": 2.031052243391612,
642
+ "learning_rate": 3.2896679940091913e-06,
643
+ "loss": 1.2327,
644
+ "step": 860
645
+ },
646
+ {
647
+ "epoch": 0.8422071636011617,
648
+ "grad_norm": 1.8731727882620728,
649
+ "learning_rate": 3.2523874334576456e-06,
650
+ "loss": 1.2282,
651
+ "step": 870
652
+ },
653
+ {
654
+ "epoch": 0.8518877057115198,
655
+ "grad_norm": 1.673399951977698,
656
+ "learning_rate": 3.214921972889552e-06,
657
+ "loss": 1.2132,
658
+ "step": 880
659
+ },
660
+ {
661
+ "epoch": 0.861568247821878,
662
+ "grad_norm": 1.9311521456994778,
663
+ "learning_rate": 3.17728081948192e-06,
664
+ "loss": 1.2473,
665
+ "step": 890
666
+ },
667
+ {
668
+ "epoch": 0.8712487899322362,
669
+ "grad_norm": 1.6883274712781868,
670
+ "learning_rate": 3.139473223588462e-06,
671
+ "loss": 1.2524,
672
+ "step": 900
673
+ },
674
+ {
675
+ "epoch": 0.8809293320425944,
676
+ "grad_norm": 1.7885335433801643,
677
+ "learning_rate": 3.1015084764663074e-06,
678
+ "loss": 1.2423,
679
+ "step": 910
680
+ },
681
+ {
682
+ "epoch": 0.8906098741529526,
683
+ "grad_norm": 1.588598381552853,
684
+ "learning_rate": 3.063395907992671e-06,
685
+ "loss": 1.1997,
686
+ "step": 920
687
+ },
688
+ {
689
+ "epoch": 0.9002904162633107,
690
+ "grad_norm": 1.7465829948784894,
691
+ "learning_rate": 3.025144884372021e-06,
692
+ "loss": 1.2233,
693
+ "step": 930
694
+ },
695
+ {
696
+ "epoch": 0.9099709583736689,
697
+ "grad_norm": 1.7479955519439432,
698
+ "learning_rate": 2.9867648058343262e-06,
699
+ "loss": 1.2115,
700
+ "step": 940
701
+ },
702
+ {
703
+ "epoch": 0.9196515004840271,
704
+ "grad_norm": 1.838586531536656,
705
+ "learning_rate": 2.948265104324941e-06,
706
+ "loss": 1.2139,
707
+ "step": 950
708
+ },
709
+ {
710
+ "epoch": 0.9293320425943853,
711
+ "grad_norm": 1.8881700515006026,
712
+ "learning_rate": 2.9096552411866903e-06,
713
+ "loss": 1.2201,
714
+ "step": 960
715
+ },
716
+ {
717
+ "epoch": 0.9390125847047435,
718
+ "grad_norm": 1.9341900631906526,
719
+ "learning_rate": 2.8709447048347394e-06,
720
+ "loss": 1.1997,
721
+ "step": 970
722
+ },
723
+ {
724
+ "epoch": 0.9486931268151017,
725
+ "grad_norm": 1.689669805952995,
726
+ "learning_rate": 2.832143008424802e-06,
727
+ "loss": 1.2363,
728
+ "step": 980
729
+ },
730
+ {
731
+ "epoch": 0.9583736689254598,
732
+ "grad_norm": 1.7360587308811468,
733
+ "learning_rate": 2.7932596875152747e-06,
734
+ "loss": 1.2573,
735
+ "step": 990
736
+ },
737
+ {
738
+ "epoch": 0.968054211035818,
739
+ "grad_norm": 1.6379992314547125,
740
+ "learning_rate": 2.754304297723862e-06,
741
+ "loss": 1.2403,
742
+ "step": 1000
743
+ },
744
+ {
745
+ "epoch": 0.968054211035818,
746
+ "eval_loss": 1.2926961183547974,
747
+ "eval_runtime": 12.1297,
748
+ "eval_samples_per_second": 55.072,
749
+ "eval_steps_per_second": 3.463,
750
+ "step": 1000
751
+ },
752
+ {
753
+ "epoch": 0.9777347531461762,
754
+ "grad_norm": 1.8316472126902184,
755
+ "learning_rate": 2.7152864123792716e-06,
756
+ "loss": 1.2915,
757
+ "step": 1010
758
+ },
759
+ {
760
+ "epoch": 0.9874152952565344,
761
+ "grad_norm": 2.0299073379393096,
762
+ "learning_rate": 2.6762156201685627e-06,
763
+ "loss": 1.2246,
764
+ "step": 1020
765
+ },
766
+ {
767
+ "epoch": 0.9970958373668926,
768
+ "grad_norm": 1.6668629036408331,
769
+ "learning_rate": 2.6371015227807127e-06,
770
+ "loss": 1.302,
771
+ "step": 1030
772
+ },
773
+ {
774
+ "epoch": 1.0067763794772506,
775
+ "grad_norm": 1.8978062087217702,
776
+ "learning_rate": 2.5979537325469913e-06,
777
+ "loss": 1.1438,
778
+ "step": 1040
779
+ },
780
+ {
781
+ "epoch": 1.016456921587609,
782
+ "grad_norm": 1.994568701365586,
783
+ "learning_rate": 2.558781870078722e-06,
784
+ "loss": 0.9893,
785
+ "step": 1050
786
+ },
787
+ {
788
+ "epoch": 1.026137463697967,
789
+ "grad_norm": 1.9256869273008883,
790
+ "learning_rate": 2.5195955619030064e-06,
791
+ "loss": 0.9725,
792
+ "step": 1060
793
+ },
794
+ {
795
+ "epoch": 1.0358180058083253,
796
+ "grad_norm": 2.107489644464238,
797
+ "learning_rate": 2.480404438096994e-06,
798
+ "loss": 0.9776,
799
+ "step": 1070
800
+ },
801
+ {
802
+ "epoch": 1.0454985479186834,
803
+ "grad_norm": 1.749601621960194,
804
+ "learning_rate": 2.441218129921278e-06,
805
+ "loss": 1.0161,
806
+ "step": 1080
807
+ },
808
+ {
809
+ "epoch": 1.0551790900290416,
810
+ "grad_norm": 1.9337057731043839,
811
+ "learning_rate": 2.402046267453009e-06,
812
+ "loss": 1.0164,
813
+ "step": 1090
814
+ },
815
+ {
816
+ "epoch": 1.0648596321393997,
817
+ "grad_norm": 2.1084811781965307,
818
+ "learning_rate": 2.3628984772192885e-06,
819
+ "loss": 0.9799,
820
+ "step": 1100
821
+ },
822
+ {
823
+ "epoch": 1.074540174249758,
824
+ "grad_norm": 2.3394701146918218,
825
+ "learning_rate": 2.323784379831438e-06,
826
+ "loss": 0.9829,
827
+ "step": 1110
828
+ },
829
+ {
830
+ "epoch": 1.084220716360116,
831
+ "grad_norm": 2.192823910061485,
832
+ "learning_rate": 2.2847135876207292e-06,
833
+ "loss": 0.9397,
834
+ "step": 1120
835
+ },
836
+ {
837
+ "epoch": 1.0939012584704744,
838
+ "grad_norm": 1.8908009391071738,
839
+ "learning_rate": 2.245695702276139e-06,
840
+ "loss": 0.9544,
841
+ "step": 1130
842
+ },
843
+ {
844
+ "epoch": 1.1035818005808324,
845
+ "grad_norm": 2.056158442927219,
846
+ "learning_rate": 2.2067403124847257e-06,
847
+ "loss": 0.9867,
848
+ "step": 1140
849
+ },
850
+ {
851
+ "epoch": 1.1132623426911907,
852
+ "grad_norm": 1.8426000838714471,
853
+ "learning_rate": 2.167856991575199e-06,
854
+ "loss": 0.9843,
855
+ "step": 1150
856
+ },
857
+ {
858
+ "epoch": 1.1229428848015488,
859
+ "grad_norm": 1.8272505713437244,
860
+ "learning_rate": 2.1290552951652614e-06,
861
+ "loss": 0.9621,
862
+ "step": 1160
863
+ },
864
+ {
865
+ "epoch": 1.132623426911907,
866
+ "grad_norm": 1.8843537624286948,
867
+ "learning_rate": 2.09034475881331e-06,
868
+ "loss": 1.0003,
869
+ "step": 1170
870
+ },
871
+ {
872
+ "epoch": 1.1423039690222652,
873
+ "grad_norm": 1.7985029833743278,
874
+ "learning_rate": 2.0517348956750597e-06,
875
+ "loss": 0.9598,
876
+ "step": 1180
877
+ },
878
+ {
879
+ "epoch": 1.1519845111326235,
880
+ "grad_norm": 1.9352819350548416,
881
+ "learning_rate": 2.0132351941656737e-06,
882
+ "loss": 0.9328,
883
+ "step": 1190
884
+ },
885
+ {
886
+ "epoch": 1.1616650532429815,
887
+ "grad_norm": 2.053591523251749,
888
+ "learning_rate": 1.9748551156279803e-06,
889
+ "loss": 0.9994,
890
+ "step": 1200
891
+ },
892
+ {
893
+ "epoch": 1.1616650532429815,
894
+ "eval_loss": 1.3311784267425537,
895
+ "eval_runtime": 11.038,
896
+ "eval_samples_per_second": 60.518,
897
+ "eval_steps_per_second": 3.805,
898
+ "step": 1200
899
+ },
900
+ {
901
+ "epoch": 1.1713455953533398,
902
+ "grad_norm": 1.8508139660473053,
903
+ "learning_rate": 1.93660409200733e-06,
904
+ "loss": 1.0021,
905
+ "step": 1210
906
+ },
907
+ {
908
+ "epoch": 1.181026137463698,
909
+ "grad_norm": 1.7834354892337438,
910
+ "learning_rate": 1.8984915235336934e-06,
911
+ "loss": 1.0096,
912
+ "step": 1220
913
+ },
914
+ {
915
+ "epoch": 1.1907066795740562,
916
+ "grad_norm": 1.9980460190401137,
917
+ "learning_rate": 1.860526776411539e-06,
918
+ "loss": 1.0096,
919
+ "step": 1230
920
+ },
921
+ {
922
+ "epoch": 1.2003872216844143,
923
+ "grad_norm": 1.9068932720510319,
924
+ "learning_rate": 1.8227191805180806e-06,
925
+ "loss": 1.0768,
926
+ "step": 1240
927
+ },
928
+ {
929
+ "epoch": 1.2100677637947725,
930
+ "grad_norm": 1.912490766033442,
931
+ "learning_rate": 1.7850780271104483e-06,
932
+ "loss": 1.0072,
933
+ "step": 1250
934
+ },
935
+ {
936
+ "epoch": 1.2197483059051306,
937
+ "grad_norm": 2.185779735989334,
938
+ "learning_rate": 1.747612566542356e-06,
939
+ "loss": 0.9954,
940
+ "step": 1260
941
+ },
942
+ {
943
+ "epoch": 1.229428848015489,
944
+ "grad_norm": 2.1799620122111283,
945
+ "learning_rate": 1.7103320059908093e-06,
946
+ "loss": 0.9856,
947
+ "step": 1270
948
+ },
949
+ {
950
+ "epoch": 1.239109390125847,
951
+ "grad_norm": 1.754880227600898,
952
+ "learning_rate": 1.6732455071934424e-06,
953
+ "loss": 0.9882,
954
+ "step": 1280
955
+ },
956
+ {
957
+ "epoch": 1.2487899322362053,
958
+ "grad_norm": 2.099244672028386,
959
+ "learning_rate": 1.6363621841970022e-06,
960
+ "loss": 0.9218,
961
+ "step": 1290
962
+ },
963
+ {
964
+ "epoch": 1.2584704743465633,
965
+ "grad_norm": 2.120833202856271,
966
+ "learning_rate": 1.5996911011175675e-06,
967
+ "loss": 0.9803,
968
+ "step": 1300
969
+ },
970
+ {
971
+ "epoch": 1.2681510164569216,
972
+ "grad_norm": 1.9306623752452063,
973
+ "learning_rate": 1.5632412699130306e-06,
974
+ "loss": 0.9732,
975
+ "step": 1310
976
+ },
977
+ {
978
+ "epoch": 1.2778315585672797,
979
+ "grad_norm": 1.929234322625883,
980
+ "learning_rate": 1.5270216481683954e-06,
981
+ "loss": 0.9656,
982
+ "step": 1320
983
+ },
984
+ {
985
+ "epoch": 1.287512100677638,
986
+ "grad_norm": 1.7522945520010809,
987
+ "learning_rate": 1.4910411368944483e-06,
988
+ "loss": 0.968,
989
+ "step": 1330
990
+ },
991
+ {
992
+ "epoch": 1.297192642787996,
993
+ "grad_norm": 2.0785744202804843,
994
+ "learning_rate": 1.4553085783403201e-06,
995
+ "loss": 1.0192,
996
+ "step": 1340
997
+ },
998
+ {
999
+ "epoch": 1.3068731848983544,
1000
+ "grad_norm": 1.8612922010001036,
1001
+ "learning_rate": 1.419832753820496e-06,
1002
+ "loss": 0.9862,
1003
+ "step": 1350
1004
+ },
1005
+ {
1006
+ "epoch": 1.3165537270087124,
1007
+ "grad_norm": 1.822665782350888,
1008
+ "learning_rate": 1.3846223815568005e-06,
1009
+ "loss": 0.9557,
1010
+ "step": 1360
1011
+ },
1012
+ {
1013
+ "epoch": 1.3262342691190707,
1014
+ "grad_norm": 2.0230674241134565,
1015
+ "learning_rate": 1.349686114535875e-06,
1016
+ "loss": 0.9833,
1017
+ "step": 1370
1018
+ },
1019
+ {
1020
+ "epoch": 1.3359148112294288,
1021
+ "grad_norm": 1.860216169744428,
1022
+ "learning_rate": 1.3150325383827117e-06,
1023
+ "loss": 1.0331,
1024
+ "step": 1380
1025
+ },
1026
+ {
1027
+ "epoch": 1.345595353339787,
1028
+ "grad_norm": 1.7708281013805311,
1029
+ "learning_rate": 1.2806701692507162e-06,
1030
+ "loss": 1.0069,
1031
+ "step": 1390
1032
+ },
1033
+ {
1034
+ "epoch": 1.3552758954501452,
1035
+ "grad_norm": 2.2229157558033057,
1036
+ "learning_rate": 1.2466074517288558e-06,
1037
+ "loss": 0.9531,
1038
+ "step": 1400
1039
+ },
1040
+ {
1041
+ "epoch": 1.3552758954501452,
1042
+ "eval_loss": 1.3317842483520508,
1043
+ "eval_runtime": 11.0567,
1044
+ "eval_samples_per_second": 60.416,
1045
+ "eval_steps_per_second": 3.799,
1046
+ "step": 1400
1047
+ },
1048
+ {
1049
+ "epoch": 1.3649564375605034,
1050
+ "grad_norm": 2.2172077437876148,
1051
+ "learning_rate": 1.212852756766399e-06,
1052
+ "loss": 0.991,
1053
+ "step": 1410
1054
+ },
1055
+ {
1056
+ "epoch": 1.3746369796708615,
1057
+ "grad_norm": 1.9872462249614302,
1058
+ "learning_rate": 1.1794143796157358e-06,
1059
+ "loss": 0.957,
1060
+ "step": 1420
1061
+ },
1062
+ {
1063
+ "epoch": 1.3843175217812198,
1064
+ "grad_norm": 1.9736206439733366,
1065
+ "learning_rate": 1.1463005377938182e-06,
1066
+ "loss": 0.9406,
1067
+ "step": 1430
1068
+ },
1069
+ {
1070
+ "epoch": 1.3939980638915779,
1071
+ "grad_norm": 1.868279202351412,
1072
+ "learning_rate": 1.1135193690626926e-06,
1073
+ "loss": 0.958,
1074
+ "step": 1440
1075
+ },
1076
+ {
1077
+ "epoch": 1.4036786060019362,
1078
+ "grad_norm": 1.773679136583482,
1079
+ "learning_rate": 1.0810789294296397e-06,
1080
+ "loss": 1.0262,
1081
+ "step": 1450
1082
+ },
1083
+ {
1084
+ "epoch": 1.4133591481122942,
1085
+ "grad_norm": 1.9372925542431774,
1086
+ "learning_rate": 1.048987191167398e-06,
1087
+ "loss": 0.9745,
1088
+ "step": 1460
1089
+ },
1090
+ {
1091
+ "epoch": 1.4230396902226525,
1092
+ "grad_norm": 1.9130176009818487,
1093
+ "learning_rate": 1.0172520408549716e-06,
1094
+ "loss": 0.9759,
1095
+ "step": 1470
1096
+ },
1097
+ {
1098
+ "epoch": 1.4327202323330106,
1099
+ "grad_norm": 1.7950360757272799,
1100
+ "learning_rate": 9.858812774394946e-07,
1101
+ "loss": 1.0117,
1102
+ "step": 1480
1103
+ },
1104
+ {
1105
+ "epoch": 1.442400774443369,
1106
+ "grad_norm": 2.049973774645242,
1107
+ "learning_rate": 9.548826103196304e-07,
1108
+ "loss": 0.9736,
1109
+ "step": 1490
1110
+ },
1111
+ {
1112
+ "epoch": 1.452081316553727,
1113
+ "grad_norm": 2.1258539830795975,
1114
+ "learning_rate": 9.242636574509828e-07,
1115
+ "loss": 1.002,
1116
+ "step": 1500
1117
+ },
1118
+ {
1119
+ "epoch": 1.4617618586640853,
1120
+ "grad_norm": 2.005339561713667,
1121
+ "learning_rate": 8.940319434739683e-07,
1122
+ "loss": 1.0391,
1123
+ "step": 1510
1124
+ },
1125
+ {
1126
+ "epoch": 1.4714424007744433,
1127
+ "grad_norm": 2.239017432444768,
1128
+ "learning_rate": 8.641948978646361e-07,
1129
+ "loss": 0.9864,
1130
+ "step": 1520
1131
+ },
1132
+ {
1133
+ "epoch": 1.4811229428848016,
1134
+ "grad_norm": 1.9665985929622667,
1135
+ "learning_rate": 8.347598531088555e-07,
1136
+ "loss": 1.0425,
1137
+ "step": 1530
1138
+ },
1139
+ {
1140
+ "epoch": 1.4908034849951597,
1141
+ "grad_norm": 2.183127365642101,
1142
+ "learning_rate": 8.05734042900363e-07,
1143
+ "loss": 1.0028,
1144
+ "step": 1540
1145
+ },
1146
+ {
1147
+ "epoch": 1.500484027105518,
1148
+ "grad_norm": 1.9211536428186309,
1149
+ "learning_rate": 7.771246003630625e-07,
1150
+ "loss": 0.9764,
1151
+ "step": 1550
1152
+ },
1153
+ {
1154
+ "epoch": 1.510164569215876,
1155
+ "grad_norm": 2.1375711365241647,
1156
+ "learning_rate": 7.489385562980589e-07,
1157
+ "loss": 0.9658,
1158
+ "step": 1560
1159
+ },
1160
+ {
1161
+ "epoch": 1.5198451113262341,
1162
+ "grad_norm": 1.8686163130309794,
1163
+ "learning_rate": 7.211828374558311e-07,
1164
+ "loss": 0.9621,
1165
+ "step": 1570
1166
+ },
1167
+ {
1168
+ "epoch": 1.5295256534365924,
1169
+ "grad_norm": 1.880595967296347,
1170
+ "learning_rate": 6.938642648339719e-07,
1171
+ "loss": 0.9874,
1172
+ "step": 1580
1173
+ },
1174
+ {
1175
+ "epoch": 1.5392061955469507,
1176
+ "grad_norm": 1.9750877274994336,
1177
+ "learning_rate": 6.669895520009239e-07,
1178
+ "loss": 0.9481,
1179
+ "step": 1590
1180
+ },
1181
+ {
1182
+ "epoch": 1.5488867376573088,
1183
+ "grad_norm": 2.074716832274806,
1184
+ "learning_rate": 6.405653034461115e-07,
1185
+ "loss": 0.9555,
1186
+ "step": 1600
1187
+ },
1188
+ {
1189
+ "epoch": 1.5488867376573088,
1190
+ "eval_loss": 1.3306459188461304,
1191
+ "eval_runtime": 11.0266,
1192
+ "eval_samples_per_second": 60.581,
1193
+ "eval_steps_per_second": 3.809,
1194
+ "step": 1600
1195
+ },
1196
+ {
1197
+ "epoch": 1.5585672797676668,
1198
+ "grad_norm": 1.9848611004154655,
1199
+ "learning_rate": 6.145980129568823e-07,
1200
+ "loss": 1.0002,
1201
+ "step": 1610
1202
+ },
1203
+ {
1204
+ "epoch": 1.5682478218780251,
1205
+ "grad_norm": 2.0724938891948956,
1206
+ "learning_rate": 5.890940620226479e-07,
1207
+ "loss": 1.0028,
1208
+ "step": 1620
1209
+ },
1210
+ {
1211
+ "epoch": 1.5779283639883834,
1212
+ "grad_norm": 2.123553045180447,
1213
+ "learning_rate": 5.640597182666324e-07,
1214
+ "loss": 0.9734,
1215
+ "step": 1630
1216
+ },
1217
+ {
1218
+ "epoch": 1.5876089060987415,
1219
+ "grad_norm": 1.914830200132737,
1220
+ "learning_rate": 5.395011339055886e-07,
1221
+ "loss": 0.976,
1222
+ "step": 1640
1223
+ },
1224
+ {
1225
+ "epoch": 1.5972894482090996,
1226
+ "grad_norm": 2.171534814633674,
1227
+ "learning_rate": 5.154243442378934e-07,
1228
+ "loss": 0.9662,
1229
+ "step": 1650
1230
+ },
1231
+ {
1232
+ "epoch": 1.6069699903194579,
1233
+ "grad_norm": 2.201937231455955,
1234
+ "learning_rate": 4.918352661603604e-07,
1235
+ "loss": 1.0096,
1236
+ "step": 1660
1237
+ },
1238
+ {
1239
+ "epoch": 1.6166505324298162,
1240
+ "grad_norm": 2.1624331142573316,
1241
+ "learning_rate": 4.687396967141583e-07,
1242
+ "loss": 0.9944,
1243
+ "step": 1670
1244
+ },
1245
+ {
1246
+ "epoch": 1.6263310745401742,
1247
+ "grad_norm": 2.037622724361002,
1248
+ "learning_rate": 4.4614331166018403e-07,
1249
+ "loss": 0.9976,
1250
+ "step": 1680
1251
+ },
1252
+ {
1253
+ "epoch": 1.6360116166505323,
1254
+ "grad_norm": 2.206747564852964,
1255
+ "learning_rate": 4.2405166408423154e-07,
1256
+ "loss": 0.9404,
1257
+ "step": 1690
1258
+ },
1259
+ {
1260
+ "epoch": 1.6456921587608906,
1261
+ "grad_norm": 1.8730992912712554,
1262
+ "learning_rate": 4.0247018303232437e-07,
1263
+ "loss": 0.9871,
1264
+ "step": 1700
1265
+ },
1266
+ {
1267
+ "epoch": 1.6553727008712489,
1268
+ "grad_norm": 1.7877219207538437,
1269
+ "learning_rate": 3.8140417217651437e-07,
1270
+ "loss": 0.9848,
1271
+ "step": 1710
1272
+ },
1273
+ {
1274
+ "epoch": 1.665053242981607,
1275
+ "grad_norm": 1.9978225206623983,
1276
+ "learning_rate": 3.608588085115028e-07,
1277
+ "loss": 1.0306,
1278
+ "step": 1720
1279
+ },
1280
+ {
1281
+ "epoch": 1.674733785091965,
1282
+ "grad_norm": 2.016944090640306,
1283
+ "learning_rate": 3.408391410823864e-07,
1284
+ "loss": 0.9356,
1285
+ "step": 1730
1286
+ },
1287
+ {
1288
+ "epoch": 1.6844143272023233,
1289
+ "grad_norm": 1.8659688480019923,
1290
+ "learning_rate": 3.213500897438487e-07,
1291
+ "loss": 0.9694,
1292
+ "step": 1740
1293
+ },
1294
+ {
1295
+ "epoch": 1.6940948693126816,
1296
+ "grad_norm": 1.9980865391289462,
1297
+ "learning_rate": 3.023964439511026e-07,
1298
+ "loss": 0.9816,
1299
+ "step": 1750
1300
+ },
1301
+ {
1302
+ "epoch": 1.7037754114230397,
1303
+ "grad_norm": 2.009781583476138,
1304
+ "learning_rate": 2.839828615828744e-07,
1305
+ "loss": 1.0213,
1306
+ "step": 1760
1307
+ },
1308
+ {
1309
+ "epoch": 1.7134559535333977,
1310
+ "grad_norm": 1.8630110644291102,
1311
+ "learning_rate": 2.6611386779672786e-07,
1312
+ "loss": 0.9737,
1313
+ "step": 1770
1314
+ },
1315
+ {
1316
+ "epoch": 1.723136495643756,
1317
+ "grad_norm": 2.0561453406581616,
1318
+ "learning_rate": 2.487938539169982e-07,
1319
+ "loss": 0.9567,
1320
+ "step": 1780
1321
+ },
1322
+ {
1323
+ "epoch": 1.7328170377541143,
1324
+ "grad_norm": 2.298935987903855,
1325
+ "learning_rate": 2.3202707635562371e-07,
1326
+ "loss": 0.9943,
1327
+ "step": 1790
1328
+ },
1329
+ {
1330
+ "epoch": 1.7424975798644724,
1331
+ "grad_norm": 1.9952588611867719,
1332
+ "learning_rate": 2.1581765556612233e-07,
1333
+ "loss": 0.9866,
1334
+ "step": 1800
1335
+ },
1336
+ {
1337
+ "epoch": 1.7424975798644724,
1338
+ "eval_loss": 1.3297312259674072,
1339
+ "eval_runtime": 12.1526,
1340
+ "eval_samples_per_second": 54.968,
1341
+ "eval_steps_per_second": 3.456,
1342
+ "step": 1800
1343
+ },
1344
+ {
1345
+ "epoch": 1.7521781219748305,
1346
+ "grad_norm": 2.134773106840517,
1347
+ "learning_rate": 2.001695750309926e-07,
1348
+ "loss": 0.9948,
1349
+ "step": 1810
1350
+ },
1351
+ {
1352
+ "epoch": 1.7618586640851888,
1353
+ "grad_norm": 2.144436773492719,
1354
+ "learning_rate": 1.8508668028276305e-07,
1355
+ "loss": 0.9888,
1356
+ "step": 1820
1357
+ },
1358
+ {
1359
+ "epoch": 1.771539206195547,
1360
+ "grad_norm": 1.9517157284176012,
1361
+ "learning_rate": 1.7057267795895117e-07,
1362
+ "loss": 1.0625,
1363
+ "step": 1830
1364
+ },
1365
+ {
1366
+ "epoch": 1.7812197483059051,
1367
+ "grad_norm": 2.1230795918159022,
1368
+ "learning_rate": 1.566311348911534e-07,
1369
+ "loss": 0.9834,
1370
+ "step": 1840
1371
+ },
1372
+ {
1373
+ "epoch": 1.7909002904162632,
1374
+ "grad_norm": 2.0986406154174726,
1375
+ "learning_rate": 1.4326547722848972e-07,
1376
+ "loss": 0.9507,
1377
+ "step": 1850
1378
+ },
1379
+ {
1380
+ "epoch": 1.8005808325266215,
1381
+ "grad_norm": 1.8949427782604102,
1382
+ "learning_rate": 1.3047898959562767e-07,
1383
+ "loss": 0.9997,
1384
+ "step": 1860
1385
+ },
1386
+ {
1387
+ "epoch": 1.8102613746369798,
1388
+ "grad_norm": 2.2118114869360226,
1389
+ "learning_rate": 1.1827481428557969e-07,
1390
+ "loss": 0.9919,
1391
+ "step": 1870
1392
+ },
1393
+ {
1394
+ "epoch": 1.8199419167473379,
1395
+ "grad_norm": 2.332050729229143,
1396
+ "learning_rate": 1.0665595048748257e-07,
1397
+ "loss": 0.9774,
1398
+ "step": 1880
1399
+ },
1400
+ {
1401
+ "epoch": 1.829622458857696,
1402
+ "grad_norm": 1.990555321029242,
1403
+ "learning_rate": 9.562525354954194e-08,
1404
+ "loss": 1.0053,
1405
+ "step": 1890
1406
+ },
1407
+ {
1408
+ "epoch": 1.8393030009680542,
1409
+ "grad_norm": 2.21958378472287,
1410
+ "learning_rate": 8.518543427732951e-08,
1411
+ "loss": 0.9824,
1412
+ "step": 1900
1413
+ },
1414
+ {
1415
+ "epoch": 1.8489835430784125,
1416
+ "grad_norm": 1.9806545361274472,
1417
+ "learning_rate": 7.53390582675978e-08,
1418
+ "loss": 0.9876,
1419
+ "step": 1910
1420
+ },
1421
+ {
1422
+ "epoch": 1.8586640851887706,
1423
+ "grad_norm": 2.115087225131469,
1424
+ "learning_rate": 6.608854527778319e-08,
1425
+ "loss": 0.9587,
1426
+ "step": 1920
1427
+ },
1428
+ {
1429
+ "epoch": 1.8683446272991286,
1430
+ "grad_norm": 2.167614031987702,
1431
+ "learning_rate": 5.743616863134793e-08,
1432
+ "loss": 0.9771,
1433
+ "step": 1930
1434
+ },
1435
+ {
1436
+ "epoch": 1.878025169409487,
1437
+ "grad_norm": 1.9710104273100284,
1438
+ "learning_rate": 4.938405465910706e-08,
1439
+ "loss": 0.9591,
1440
+ "step": 1940
1441
+ },
1442
+ {
1443
+ "epoch": 1.8877057115198452,
1444
+ "grad_norm": 1.9445231599191304,
1445
+ "learning_rate": 4.193418217668305e-08,
1446
+ "loss": 0.9684,
1447
+ "step": 1950
1448
+ },
1449
+ {
1450
+ "epoch": 1.8973862536302033,
1451
+ "grad_norm": 1.9461388693279436,
1452
+ "learning_rate": 3.508838199820591e-08,
1453
+ "loss": 0.9637,
1454
+ "step": 1960
1455
+ },
1456
+ {
1457
+ "epoch": 1.9070667957405614,
1458
+ "grad_norm": 1.9423016553139207,
1459
+ "learning_rate": 2.884833648639257e-08,
1460
+ "loss": 1.0192,
1461
+ "step": 1970
1462
+ },
1463
+ {
1464
+ "epoch": 1.9167473378509197,
1465
+ "grad_norm": 2.0557831910870727,
1466
+ "learning_rate": 2.3215579139101996e-08,
1467
+ "loss": 0.9587,
1468
+ "step": 1980
1469
+ },
1470
+ {
1471
+ "epoch": 1.926427879961278,
1472
+ "grad_norm": 2.102313560231879,
1473
+ "learning_rate": 1.8191494212477513e-08,
1474
+ "loss": 0.9727,
1475
+ "step": 1990
1476
+ },
1477
+ {
1478
+ "epoch": 1.936108422071636,
1479
+ "grad_norm": 2.0696636891849263,
1480
+ "learning_rate": 1.3777316380763073e-08,
1481
+ "loss": 0.9891,
1482
+ "step": 2000
1483
+ },
1484
+ {
1485
+ "epoch": 1.936108422071636,
1486
+ "eval_loss": 1.328829288482666,
1487
+ "eval_runtime": 11.0059,
1488
+ "eval_samples_per_second": 60.695,
1489
+ "eval_steps_per_second": 3.816,
1490
+ "step": 2000
1491
+ },
1492
+ {
1493
+ "epoch": 1.945788964181994,
1494
+ "grad_norm": 1.8247235042196286,
1495
+ "learning_rate": 9.9741304328832e-09,
1496
+ "loss": 0.9352,
1497
+ "step": 2010
1498
+ },
1499
+ {
1500
+ "epoch": 1.9554695062923524,
1501
+ "grad_norm": 2.289088556921837,
1502
+ "learning_rate": 6.782871005851788e-09,
1503
+ "loss": 0.9591,
1504
+ "step": 2020
1505
+ },
1506
+ {
1507
+ "epoch": 1.9651500484027107,
1508
+ "grad_norm": 2.030627946550713,
1509
+ "learning_rate": 4.2043223550869425e-09,
1510
+ "loss": 0.9201,
1511
+ "step": 2030
1512
+ },
1513
+ {
1514
+ "epoch": 1.9748305905130688,
1515
+ "grad_norm": 1.935525119332238,
1516
+ "learning_rate": 2.239118161677656e-09,
1517
+ "loss": 0.9517,
1518
+ "step": 2040
1519
+ },
1520
+ {
1521
+ "epoch": 1.9845111326234268,
1522
+ "grad_norm": 1.8394321921167789,
1523
+ "learning_rate": 8.877413766561482e-10,
1524
+ "loss": 0.9299,
1525
+ "step": 2050
1526
+ },
1527
+ {
1528
+ "epoch": 1.9941916747337851,
1529
+ "grad_norm": 2.1282656054902067,
1530
+ "learning_rate": 1.5052410231336522e-10,
1531
+ "loss": 0.9732,
1532
+ "step": 2060
1533
+ },
1534
+ {
1535
+ "epoch": 2.0,
1536
+ "step": 2066,
1537
+ "total_flos": 93553682546688.0,
1538
+ "train_loss": 0.9335812793004201,
1539
+ "train_runtime": 7476.0796,
1540
+ "train_samples_per_second": 17.685,
1541
+ "train_steps_per_second": 0.276
1542
+ }
1543
+ ],
1544
+ "logging_steps": 10,
1545
+ "max_steps": 2066,
1546
+ "num_input_tokens_seen": 0,
1547
+ "num_train_epochs": 2,
1548
+ "save_steps": 100,
1549
+ "stateful_callbacks": {
1550
+ "TrainerControl": {
1551
+ "args": {
1552
+ "should_epoch_stop": false,
1553
+ "should_evaluate": false,
1554
+ "should_log": false,
1555
+ "should_save": true,
1556
+ "should_training_stop": true
1557
+ },
1558
+ "attributes": {}
1559
+ }
1560
+ },
1561
+ "total_flos": 93553682546688.0,
1562
+ "train_batch_size": 2,
1563
+ "trial_name": null,
1564
+ "trial_params": null
1565
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5aed7aca8c570c5b079a351c19de48e2066206f25c68c55d51c258dcb784d83
3
+ size 8209
training_eval_loss.png ADDED
training_loss.png ADDED