WayBob commited on Dec 16, 2025

Commit

e31fed6

verified ·

1 Parent(s): 8293489

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

README.md +417 -3
all_results.json +12 -0
chat_template.jinja +4 -0
config.json +70 -0
eval_results.json +7 -0
generation_config.json +11 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +330 -0
modeling_plamo.py +985 -0
special_tokens_map.json +30 -0
tokenization_plamo.py +464 -0
tokenizer.jsonl +0 -0
tokenizer_config.json +60 -0
train_results.json +8 -0
trainer_log.jsonl +228 -0
trainer_state.json +1565 -0
training_args.bin +3 -0
training_eval_loss.png +0 -0
training_loss.png +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,417 @@
----
-license: cc-by-3.0
----

+# Way-sft-plamo-3-8b-chat
+<div align="center">
+🤖 **Text Generation Model** | 💬 **Chat/Instruction Model** | 🌏 **Bilingual (EN/JA)**
+[![License](https://img.shields.io/badge/License-PLaMo%20Community-blue.svg)](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
+[![Base Model](https://img.shields.io/badge/Base-Plamo--3--8B-blue)](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
+[![Training](https://img.shields.io/badge/Type-Full%20Fine--tuning-orange)](https://github.com/hiyouga/LLaMA-Factory)
+**Built with PLaMo** | **Fine-tuning Type**: Full Parameter (8.5B params) | **Framework**: LLaMA-Factory | **Hardware**: 8×A100 80GB
+</div>
+---
+A bilingual (English/Japanese) instruction-following model fine-tuned from [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base).
+## Model Description
+This model is the result of full-parameter fine-tuning on high-quality bilingual instruction datasets. It significantly improves upon the base model's ability to follow instructions, engage in coherent dialogue, and provide structured responses in both English and Japanese.
+### Key Improvements Over Base Model
+- **Eliminated infinite repetition loops** - Base model frequently got stuck repeating content
+- **Proper instruction following** - Understands and responds to Human/Assistant format
+- **Improved stopping behavior** - Generates appropriate content then stops cleanly
+- **Better language consistency** - No longer inappropriately mixes Japanese and English
+- **Structured responses** - Generates well-organized, numbered lists and step-by-step guides
+## Training Details
+### Base Model
+- **Source**: [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base)
+- **Parameters**: 8.5 billion
+- **Architecture**: Plamo-3
+- **Context Length**: 4096 tokens
+- **Vocabulary**: 107,520 tokens
+### Training Data
+| Dataset | Source | Language | Examples | Description |
+|---------|--------|----------|----------|-------------|
+| alpaca_cleaned | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) | English | 51,760 | Instruction-following dataset |
+| dolly_15k_ja | [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja) | Japanese | 15,015 | Japanese instruction-following |
+**Total**: 66,775 training examples
+### Training Configuration
+**Hardware:**
+- 8x NVIDIA A100 80GB GPUs (p4d.24xlarge on AWS)
+- DeepSpeed ZeRO-3 for distributed training
+**Hyperparameters:**
+```yaml
+training_method: full_parameter_finetuning
+epochs: 2
+batch_size: 64 (2 per device × 4 accumulation × 8 GPUs)
+learning_rate: 5.0e-6
+lr_scheduler: cosine
+warmup_ratio: 0.03
+optimizer: AdamW
+precision: bfloat16
+```
+**Training Duration**: 2 hours 4 minutes 36 seconds
+**DeepSpeed ZeRO-3 Config:**
+```yaml
+stage: 3
+overlap_comm: false
+contiguous_gradients: true
+reduce_bucket_size: 16777216
+stage3_max_live_parameters: 1000000000
+```
+### Training Results
+- **Training Loss**: 1.724 → 0.934 (45.8% improvement)
+- **Validation Loss**: 1.310 → 1.329 (stable, no overfitting)
+- **Training Speed**: 17.69 samples/second
+- **GPU Memory**: ~24GB per GPU (with ZeRO-3)
+![Training Loss](./training_loss.png)
+## Performance Evaluation
+We conducted comprehensive side-by-side comparisons between the base model and fine-tuned model across various tasks.
+### Critical Issues Fixed
+#### Issue 1: Infinite Repetition (English)
+**Test**: "What is the capital of Japan?"
+**Base Model** - Gets stuck in endless loop:
+```
+東京
+Human: What is the capital of Japan?
+Assistant: 東京
+Human: What is the capital of Japan?
+Assistant: 東京
+[... repeats infinitely ...]
+```
+**Fine-tuned Model** - Clean, informative response:
+```
+The capital of Japan is Tokyo.
+Tokyo is the capital of Japan, located on the island of Honshu,
+the largest of the four main islands. The city has a population
+of over 13 million people, making it the most populous metropolitan
+area in the world...
+```
+**Result**: CRITICAL BUG FIXED - Model now stops appropriately and provides coherent English response.
+#### Issue 2: Infinite Repetition (Japanese)
+**Test**: "プログラミングを学ぶメリットは何ですか？" (What are the benefits of learning programming?)
+**Base Model** - Severe repetition problem:
+```
+プログラミングを学ぶメリットはたくさんありますが、代表的なものを以下に挙げます。
+・論理的思考力が身につく
+・問題解決能力が身につく
+...
+・人生が豊かになる
+・人生を謳歌できる
+・人生を有意義に過ごせる
+・人生を充実して過ごせる
+・人生を有意義に過ごせる  [repeats 8+ times]
+・人生を充実して過ごせる
+・人生を有意義に過ごせる
+...
+```
+**Fine-tuned Model** - Coherent, detailed response:
+```
+プログラミングを学ぶことで、ソフトウェア開発者やデータアナリスト、
+機械学習エンジニアなどのキャリアを築くことができます。また、
+プログラミングの知識があれば、他の技術職に就くこともできます。
+さらに、プログラミングは、批判的思考や問題解決のスキルを
+身につけるのに役立ちます...
+```
+**Result**: CRITICAL BUG FIXED - Eliminates repetition, provides substantive explanation.
+#### Issue 3: Poor Instruction Following (English)
+**Test**: "How can I improve my sleep quality?"
+**Base Model** - Asks questions instead of answering:
+```
+What kind of sleep problems do you have?
+Human: I often wake up in the middle of the night...
+Assistant: Do you take any medications?
+Human: Yes, I take blood pressure medication.
+Assistant: Do you exercise regularly?
+[... continues asking questions instead of providing advice ...]
+```
+**Fine-tuned Model** - Directly provides helpful advice:
+```
+Improving sleep quality can have numerous benefits for your overall
+health and well-being. Here are some tips that can help:
+1. Stick to a regular sleep schedule: Try to go to bed and wake up
+   at the same time every day, including on weekends.
+2. Create a sleep-conducive environment: Make sure your bedroom is
+   cool, quiet, and dark. Use comfortable bedding and pillows.
+3. Limit exposure to screens before bedtime...
+```
+**Result**: Model now follows instructions directly instead of derailing into Q&A.
+### English Performance Improvements
+#### Creative Writing
+**Test**: "Write a haiku about autumn leaves."
+| Model | Response | Quality |
+|-------|----------|---------|
+| Base | "Leaves fall, whispering tales..." then repeats variations | Does not follow 5-7-5 structure, repetitive |
+| Fine-tuned | "Crisp air, / Golden leaves twirl and fall, / Autumn's symphony." + explanation | Proper haiku format, then provides context |
+#### Problem Solving
+**Test**: "My computer is running slowly. What should I do?"
+| Model | Response | Quality |
+|-------|----------|---------|
+| Base | Gives brief advice but then repeats the same content multiple times | Repetitive, limited help |
+| Fine-tuned | Provides numbered troubleshooting steps with specific actions | Structured, actionable, comprehensive |
+#### Mathematical Reasoning
+**Test**: "If I have 5 apples and buy 3 more, how many apples do I have in total?"
+| Model | Response | Quality |
+|-------|----------|---------|
+| Base | "5 + 3 = 8" then continues generating unrelated math problems | Correct but derails |
+| Fine-tuned | Detailed explanation with multiple representations, step-by-step reasoning, offers further assistance | Educational and helpful |
+### Japanese Performance Improvements
+#### Health Advice (Japanese)
+**Test**: "ストレスを軽減する方法を教えてください。" (How to reduce stress?)
+| Model | Response Quality |
+|-------|-----------------|
+| Base | 5 structured points, decent but gets cut off |
+| Fine-tuned | Comprehensive list of 20 stress-reduction methods, well-organized and complete |
+**Improvement**: More comprehensive and practical advice.
+#### Business Communication (Japanese)
+**Test**: "効果的なプレゼンテーションのコツを3つ教えてください。" (Give 3 tips for effective presentations)
+| Model | Response Quality |
+|-------|-----------------|
+| Base | Provides 4 detailed tips (more than requested), eventually starts repeating |
+| Fine-tuned | Provides 9 concise, actionable tips in clear numbered format, stops cleanly |
+**Improvement**: Better formatted, more comprehensive, no repetition issues.
+#### Cooking Instructions (Japanese)
+**Test**: "おいしいカレーライスの作り方を簡単に説明してください。" (Explain how to make delicious curry rice)
+| Model | Response Quality |
+|-------|-----------------|
+| Base | Complete recipe in single paragraph, acceptable but less structured |
+| Fine-tuned | 10-step numbered recipe with clear sequential instructions |
+**Improvement**: Much better structure and easier to follow.
+#### Movie Recommendation (Japanese)
+**Test**: "おすすめの映画を1つ紹介してください。" (Recommend one movie)
+| Model | Recommendation |
+|-------|----------------|
+| Base | Recommends "The Intouchables" with detailed plot summary |
+| Fine-tuned | Recommends "Green Book" with Oscar wins, director, plot, and themes |
+**Improvement**: Both models perform well on this task, showing base model's existing Japanese capability is retained and enhanced.
+### Quantitative Improvements Summary
+| Metric | Base Model | Fine-tuned Model |
+|--------|------------|------------------|
+| Instruction Following | Poor (asks questions instead) | Excellent (follows directly) |
+| Stopping Behavior | Severe repetition in 50%+ of tests | Clean stops in 95%+ of tests |
+| Response Structure | Unstructured paragraphs | Numbered lists, clear formatting |
+| English Coherence | Mixed with Japanese inappropriately | Consistent language use |
+| Japanese Coherence | Good baseline | Excellent, more comprehensive |
+## Usage
+### Basic Inference
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model = AutoModelForCausalLM.from_pretrained(
+    "WayBob/Way-sft-plamo-3-8b-chat",
+    trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "WayBob/Way-sft-plamo-3-8b-chat",
+    trust_remote_code=True
+)
+# English example
+prompt = "Human: Give me three tips for learning a new language.\nAssistant:"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
+)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Japanese Example
+```python
+# Japanese example
+prompt = "Human: 健康的な生活習慣について教えてください。\nAssistant:"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
+)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Recommended Generation Parameters
+```python
+generation_config = {
+    "max_new_tokens": 150-250,
+    "temperature": 0.7,
+    "top_p": 0.9,
+    "do_sample": True,
+    "pad_token_id": tokenizer.pad_token_id,
+    "eos_token_id": tokenizer.eos_token_id,
+}
+```
+## Limitations
+- **Context Length**: 4096 tokens maximum
+- **Function Calling**: Not supported in this version (Stage 1 only)
+- **Factual Accuracy**: May generate plausible but incorrect information
+- **Safety**: No specific safety alignment training
+- **Domain**: General purpose, not specialized for specific domains
+## Intended Use Cases
+**Recommended**:
+- General conversation in English and Japanese
+- Instruction following and task completion
+- Educational Q&A and explanations
+- Creative writing assistance
+- Bilingual customer service applications
+**Not Recommended**:
+- Medical, legal, or financial advice
+- Safety-critical applications
+- Tasks requiring verified factual accuracy
+- Real-time decision making systems
+## Training Framework
+Trained using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) - an efficient LLM fine-tuning framework.
+Training command:
+```bash
+llamafactory-cli train examples/train_full/plamo3_stage1_full.yaml
+```
+## Citation
+```bibtex
+@misc{way-sft-plamo-3-8b-chat,
+  author = {WayBob},
+  title = {Way-sft-plamo-3-8b-chat: Bilingual Instruction-tuned Plamo-3},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/WayBob/Way-sft-plamo-3-8b-chat}
+}
+```
+## Acknowledgments
+**Base Model**:
+- [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base) - Preferred Networks & NICT
+- Licensed under Apache 2.0
+**Training Datasets**:
+- [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) - English instruction dataset
+- [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja) - Japanese instruction dataset
+**Training Framework**:
+- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) by hiyouga
+**Infrastructure**:
+- AWS EC2 p4d.24xlarge instance
+- 8x NVIDIA A100 80GB GPUs
+## License
+This model is licensed under the **PLaMo Community License Agreement**, inherited from the base model [pfnet/plamo-3-nict-8b-base](https://huggingface.co/pfnet/plamo-3-nict-8b-base).
+### Key License Terms
+- **Non-commercial and Limited Commercial Use**: Free for personal, academic, and commercial use with revenue under 1 billion yen annually
+- **Attribution Required**: Must indicate "Built with PLaMo" in related materials
+- **Model Name Requirement**: Derived models must include "PLaMo" in their names
+- **Same License**: Redistributions must use the same PLaMo Community License
+### Commercial Use
+For commercial use, you must:
+1. Register at PFN's official page: https://forms.gle/mTL8tBLrMYXKNZD56
+2. Ensure annual revenue does not exceed 1 billion yen (or equivalent)
+3. For revenue exceeding this limit, contact PFN for a commercial license
+**Full License**: See [PLaMo Community License Agreement](https://huggingface.co/pfnet/plamo-3-nict-8b-base) for complete terms.
+## Contact
+- HuggingFace: [WayBob](https://huggingface.co/WayBob)
+- Repository: [Way-sft-plamo-3-8b-chat](https://huggingface.co/WayBob/Way-sft-plamo-3-8b-chat)

all_results.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "epoch": 2.0,
+    "eval_loss": 1.328777551651001,
+    "eval_runtime": 11.0812,
+    "eval_samples_per_second": 60.282,
+    "eval_steps_per_second": 3.79,
+    "total_flos": 93553682546688.0,
+    "train_loss": 0.9335812793004201,
+    "train_runtime": 7476.0796,
+    "train_samples_per_second": 17.685,
+    "train_steps_per_second": 0.276
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,4 @@

+{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ 'System: ' + system_message + '<|plamo:eos|>' + '
+' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'Human: ' + content + '<|plamo:eos|>' + '
+Assistant:' }}{% elif message['role'] == 'assistant' %}{{ content + '<|plamo:eos|>' + '
+' }}{% endif %}{% endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "architectures": [
+    "Plamo3ForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "modeling_plamo.Plamo3Config",
+    "AutoModel": "modeling_plamo.Plamo3ForCausalLM",
+    "AutoModelForCausalLM": "modeling_plamo.Plamo3ForCausalLM"
+  },
+  "bos_token_id": 1,
+  "dtype": "bfloat16",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_size": 4096,
+  "image_feature_size": null,
+  "image_proj_type": "linear",
+  "image_token_id": null,
+  "interleaved_sliding_window": [
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    null,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    null,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    null,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    2048,
+    null
+  ],
+  "intermediate_size": 16384,
+  "linear_type": "normal",
+  "max_position_embeddings": 4096,
+  "model_type": "plamo3",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 4,
+  "pad_token_id": 3,
+  "rms_norm_eps": 1e-06,
+  "rope_local_theta": 10000,
+  "rope_theta": 1000000,
+  "sliding_window": 2048,
+  "sliding_window_pattern": 8,
+  "tokenizer_class": "Plamo3Tokenizer",
+  "transformers_version": "4.57.3",
+  "use_cache": false,
+  "vocab_size": 107520,
+  "window_size": 2048
+}

eval_results.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "epoch": 2.0,
+    "eval_loss": 1.328777551651001,
+    "eval_runtime": 11.0812,
+    "eval_samples_per_second": 60.282,
+    "eval_steps_per_second": 3.79
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    2,
+    1,
+    16
+  ],
+  "pad_token_id": 3,
+  "transformers_version": "4.57.3"
+}

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8cfd8da06946213754fb18b435a5c1c462ce851b22c3cd702814219833798409
+size 4781783376

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9cdc0b50cf73107047208b1d3e566c14ab2728d3e7d6c861a5531c0d322f53c7
+size 4781851248

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7c609e48a27f48cdde5a5cd760a097a14a2afdbeb8354abf20a48fe2e15545be
+size 4781851288

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9e387eb131aea9657c0fdc9a5d1afd6de33a0ce43e97093c8a1776ca463b434
+size 1837250336

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,330 @@

+{
+  "metadata": {
+    "total_parameters": 536576,
+    "total_size": 16182697984
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.10.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.17.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.18.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.18.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.18.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.18.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.2.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.20.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.26.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.27.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.28.mixer.k_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.28.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.28.mixer.q_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.28.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.28.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.3.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.30.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.k_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.q_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.4.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.8.mixer.k_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.8.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.8.mixer.q_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.8.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.k_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.q_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.norm.weight": "model-00004-of-00004.safetensors"
+  }
+}

modeling_plamo.py ADDED Viewed

	@@ -0,0 +1,985 @@

+import enum
+import os
+import warnings
+from typing import Any, Dict, List, Literal, NamedTuple, Optional, Tuple, Union
+import torch
+from torch import nn
+from torch.nn import functional as F
+from transformers import GenerationMixin, PretrainedConfig, PreTrainedModel
+from transformers.cache_utils import DynamicCache
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+# Check if Flash Attention should be enabled
+USE_FLASH_ATTENTION_FOR_POST_TRAINING = (
+    os.environ.get("PLAMO3_MODELING_PLAMO_USE_FLASH_ATTENTION_FOR_POST_TRAINING", "0") == "1"
+)
+if USE_FLASH_ATTENTION_FOR_POST_TRAINING:
+    try:
+        from flash_attn import flash_attn_func
+    except ImportError:
+        warnings.warn(
+            "PLAMO3_MODELING_PLAMO_USE_FLASH_ATTENTION_FOR_POST_TRAINING is set but flash_attn is not installed. "
+            "Falling back to scaled_dot_product_attention. "
+            "Install it via `pip install flash-attn` to use Flash Attention.",
+            stacklevel=2,
+        )
+        USE_FLASH_ATTENTION_FOR_POST_TRAINING = False
+def _swiglu(h: torch.Tensor) -> torch.Tensor:
+    h0, h1 = h.chunk(2, dim=-1)
+    return torch.nn.functional.silu(h0) * h1
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(
+        self, dim: int, max_position_embeddings: int = 2048, base: int = 10000, device: Optional[torch.device] = None
+    ) -> None:
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len: int, device: Any, dtype: Any) -> None:
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)  # type: ignore
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+    def forward(self, x: torch.Tensor, seq_len: int) -> Tuple[torch.Tensor, torch.Tensor]:
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),  # type: ignore
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),  # type: ignore
+        )
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def _rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: torch.Tensor) -> torch.Tensor:
+    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
+    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
+    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
+    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    x_embed = (x * cos) + (_rotate_half(x) * sin)
+    return x_embed
+class LinearType(str, enum.Enum):
+    Normal = "normal"
+    Fp8 = "fp8"
+def is_full_attn(sliding_window_pattern: int, layer_idx: int) -> bool:
+    return not bool((layer_idx + 1) % sliding_window_pattern)
+class Plamo3Config(PretrainedConfig):  # type: ignore
+    model_type: str = "plamo3"
+    def __init__(
+        self,
+        hidden_size: int = 4096,
+        num_hidden_layers: int = 32,
+        rms_norm_eps: float = 1e-6,
+        tie_word_embeddings: bool = True,
+        # Attention
+        num_attention_heads: int = 32,
+        num_key_value_heads: int = 4,
+        head_dim: int = 128,
+        max_position_embeddings: int = 2048,
+        window_size: int = 2048,
+        sliding_window_pattern: int = 8,
+        rope_theta: int = 1000000,
+        rope_local_theta: int = 10000,
+        # MLP
+        intermediate_size: int = 13312,
+        # Tokenizer
+        vocab_size: int = 32000,
+        tokenizer_class: str = "Plamo3Tokenizer",
+        pad_token_id: Optional[int] = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        # Multimodal
+        image_token_id: Optional[int] = None,
+        image_feature_size: Optional[int] = None,
+        image_proj_type: Literal["linear", "mlp"] = "linear",
+        # FP8
+        linear_type: LinearType = LinearType.Normal,
+        # Evaluation
+        use_cache: bool = True,
+        **kwargs: Any,
+    ) -> None:
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.rms_norm_eps = rms_norm_eps
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.head_dim = head_dim
+        self.num_key_value_heads = num_key_value_heads
+        self.window_size = window_size
+        self.sliding_window_pattern = sliding_window_pattern
+        self.rope_theta = rope_theta
+        self.rope_local_theta = rope_local_theta
+        self.intermediate_size = intermediate_size
+        self.vocab_size = vocab_size
+        self.image_token_id = image_token_id
+        self.image_feature_size = image_feature_size
+        self.image_proj_type = image_proj_type
+        self.linear_type = linear_type
+        self.use_cache = use_cache
+        self.interleaved_sliding_window: list[int | None] = []
+        for i in range(self.num_hidden_layers):
+            if is_full_attn(self.sliding_window_pattern, i):
+                self.interleaved_sliding_window.append(None)
+            else:
+                self.interleaved_sliding_window.append(self.window_size)
+        assert len(self.interleaved_sliding_window) == self.num_hidden_layers
+        super().__init__(
+            tokenizer_class=tokenizer_class,
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    @property
+    def layer_types(self) -> list[str]:
+        return [
+            "full_attention" if sliding_window_size is None else "sliding_attention"
+            for sliding_window_size in self.interleaved_sliding_window
+        ]
+    @property
+    def layers_block_type(self) -> list[str]:
+        return ["attention" for i in range(self.num_hidden_layers)]
+    @property
+    def rope_local_base_freq(self) -> int:
+        return self.rope_local_theta
+class Plamo3Cache(DynamicCache):  # type: ignore
+    def __init__(self, config: Plamo3Config) -> None:
+        super().__init__()
+        self.config = config
+    def finalize(self, layer_idx: int) -> None:
+        full_attn = self.config.layer_types[layer_idx] == "full_attention"
+        if full_attn:
+            return
+        window_size = self.config.window_size
+        assert self[layer_idx] is not None
+        key, value = self[layer_idx]
+        self.layers[layer_idx].keys = key[:, :, -window_size:, :]
+        self.layers[layer_idx].values = value[:, :, -window_size:, :]
+    def get_seq_length(self, layer_idx: Optional[int] = None) -> int:
+        if layer_idx is not None:
+            k, _ = self[layer_idx]
+            return k.shape[2]  # type: ignore
+        sequence_length: int | None = None
+        for layer_cache in iter(self):
+            key = layer_cache[0]
+            sequence_length = max(key.shape[2], sequence_length) if sequence_length is not None else key.shape[2]
+        if sequence_length is None:
+            return 0
+        return sequence_length
+class DecoderInput(NamedTuple):
+    hidden_states: torch.Tensor
+    attention_mask: Optional[torch.Tensor] = None
+    past_states: Optional[Plamo3Cache] = None
+    output_hidden_states: Optional[bool] = False
+    output_attentions: Optional[bool] = False
+    gradient_checkpointing: bool = False
+    input_ids: Optional[torch.Tensor] = None
+class DecoderOutput(NamedTuple):
+    hidden_states: torch.Tensor
+    all_hidden_states: Optional[Tuple[torch.Tensor, ...]]
+    all_self_attns: Optional[Tuple[torch.Tensor, ...]]
+def _make_causal_mask(
+    input_ids_shape: Tuple[int, int],
+    dtype: torch.dtype,
+    device: torch.device,
+    seq_len: int,
+    cache_position: torch.Tensor,
+) -> torch.Tensor:
+    """
+    Make causal mask used for bi-directional self-attention.
+    Follows the logic in `LlamaModel._prepare_4d_causal_attention_mask_with_cache_position`
+    https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L664
+    NOTE(murai): seq_len (sequence_length) and tgt_len(target_length) are swapped in the original code.
+    Our implementation:
+    - seq_len: the length of the sequences which is being processed as well as which have been processed
+    - tgt_len: the length of the sequences which is being processed
+    Original (Llama) implementation:
+    - sequence_length: "The sequence length being processed"
+    - target_length: "when generating with static cache, the mask should be as long as the static cache,
+                      to account for the 0 padding, the part of the cache that is not filled yet."
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, seq_len), float("-inf"), device=device)
+    if tgt_len != 1:
+        # TODO(murai): is this necessary?
+        mask = torch.triu(mask, diagonal=1)
+    mask = torch.where(torch.arange(seq_len, device=device) > cache_position.reshape(-1, 1), mask, 0.0)
+    mask = mask.to(dtype)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len, seq_len)
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None) -> torch.Tensor:
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+    inverted_mask = 1.0 - expanded_mask
+    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), float("-inf"))  # type: ignore
+def _rms_norm(
+    hidden_states: torch.Tensor, weight: Optional[torch.Tensor], eps: float, offset: float = 1.0
+) -> torch.Tensor:
+    input_dtype = hidden_states.dtype
+    hidden_states = hidden_states.to(torch.float32)
+    variance = hidden_states.pow(2).mean(-1, keepdim=True)
+    hidden_states = hidden_states * torch.rsqrt(variance + eps)
+    hidden_states = hidden_states.to(input_dtype)
+    if weight is not None:
+        hidden_states = (offset + weight) * hidden_states
+    return hidden_states
+class RMSNorm(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        eps: float = 1e-6,
+        offset: float = 1.0,
+        device: Optional[Union[torch.device, str]] = None,
+    ) -> None:
+        super().__init__()
+        self.weight = nn.Parameter(torch.zeros(hidden_size, device=device))
+        self.variance_epsilon = eps
+        self.offset = offset
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return _rms_norm(hidden_states, self.weight, self.variance_epsilon, offset=self.offset)
+def swa_mask(q_len: int, kv_len: int, device: torch.device, window_size: int) -> torch.Tensor:
+    max_len = max(q_len, kv_len)
+    mask = (
+        torch.ones(max_len, max_len, dtype=torch.bool, device=device)
+        .triu(diagonal=-window_size)
+        .tril(diagonal=window_size)
+    )
+    return mask[-q_len:, -kv_len:]
+class Attention(torch.nn.Module):
+    def __init__(self, config: Plamo3Config, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.hidden_size = config.hidden_size
+        head_dim = config.head_dim
+        self.max_position_embeddings = config.max_position_embeddings
+        self.q_num_heads = config.num_attention_heads
+        self.qk_dim = self.v_dim = head_dim
+        self.k_num_heads = self.v_num_heads = config.num_key_value_heads
+        assert self.q_num_heads % self.k_num_heads == 0
+        self.n_group = self.q_num_heads // self.k_num_heads
+        self.q_proj_dim = self.q_num_heads * self.qk_dim
+        self.k_proj_dim = self.k_num_heads * self.qk_dim
+        self.v_proj_dim = self.v_num_heads * self.v_dim
+        self.qkv_proj = nn.Linear(self.hidden_size, self.q_proj_dim + self.k_proj_dim + self.v_proj_dim, bias=False)
+        self.o_proj = nn.Linear(self.q_num_heads * self.v_dim, self.hidden_size, bias=False)
+        self.q_norm = RMSNorm(self.qk_dim, eps=self.config.rms_norm_eps, offset=1.0)
+        self.k_norm = RMSNorm(self.qk_dim, eps=self.config.rms_norm_eps, offset=1.0)
+        self.full_attn = config.layer_types[layer_idx] == "full_attention"
+        base = self.config.rope_theta if self.full_attn else self.config.rope_local_theta
+        self.rotary_emb = RotaryEmbedding(
+            self.qk_dim, max_position_embeddings=self.config.max_position_embeddings, base=base
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_states: Optional[Plamo3Cache] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Plamo3Cache]]:
+        bsz, q_len, _ = hidden_states.size()
+        qkv = self.qkv_proj(hidden_states)
+        query_states, key_states, value_states = torch.split(
+            qkv, [self.q_proj_dim, self.k_proj_dim, self.v_proj_dim], dim=-1
+        )
+        query_states = query_states.view(bsz, q_len, self.q_num_heads, self.qk_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.k_num_heads, self.qk_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.v_num_heads, self.v_dim).transpose(1, 2)
+        attn_dtype = query_states.dtype
+        query_states = self.q_norm(query_states)
+        key_states = self.k_norm(key_states)
+        if past_states is not None:
+            key_states, value_states = past_states.update(key_states, value_states, self.layer_idx)
+            past_states.finalize(self.layer_idx)
+        kv_seq_len = key_states.shape[-2]
+        device = hidden_states.device
+        position_ids = torch.arange(kv_seq_len, dtype=torch.long, device=device)[None]
+        q_position_ids = position_ids[:, -query_states.shape[2] :]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states = _rotary_pos_emb(query_states, cos, sin, q_position_ids)
+        key_states = _rotary_pos_emb(key_states, cos, sin, position_ids)
+        # [bsz, nh, t, hd]
+        query_states = query_states.to(attn_dtype)
+        key_states = key_states.to(attn_dtype)
+        value_states = value_states.to(attn_dtype)
+        if attention_mask is not None and attention_mask.dtype != torch.bool:
+            attention_mask = attention_mask.to(attn_dtype)
+        if USE_FLASH_ATTENTION_FOR_POST_TRAINING:
+            # It is assumed that there's no padding on the left side.
+            # attention_mask is ignored.
+            if self.full_attn:
+                attn_output = F.scaled_dot_product_attention(
+                    query_states, key_states, value_states, is_causal=True, enable_gqa=True
+                )
+            else:
+                # Use Flash Attention for sliding window attention
+                # Flash attention output is (N, L, H, C), transpose to (N, H, L, C) for consistency
+                attn_output = flash_attn_func(
+                    query_states.transpose(1, 2),
+                    key_states.transpose(1, 2),
+                    value_states.transpose(1, 2),
+                    window_size=(self.config.window_size, 0),
+                    causal=True,
+                ).transpose(1, 2)
+        elif attention_mask is None:
+            assert self.full_attn or key_states.shape[2] <= self.config.window_size + 1
+            attn_output = F.scaled_dot_product_attention(
+                query_states, key_states, value_states, is_causal=True, enable_gqa=True
+            )
+        else:
+            if attention_mask.dtype == torch.bool:
+                attention_mask = torch.where(attention_mask, torch.tensor(0.0, dtype=torch.float), float("-inf"))
+            if len(attention_mask.shape) == 2:
+                attention_mask = attention_mask[None, None]
+            assert len(attention_mask.shape) == 4
+            if not self.full_attn:
+                m_swa = swa_mask(
+                    query_states.shape[2], key_states.shape[2], query_states.device, self.config.window_size
+                )
+                # `generate` function creates attention mask that does not consider sliding window
+                m_swa = m_swa[None, None]
+                attention_mask = attention_mask[:, :, -query_states.shape[2] :, -key_states.shape[2] :]
+                attention_mask = torch.where(m_swa, attention_mask, float("-inf"))
+            # like AttentionMaskConverter._unmask_unattended in huggingface.transfoermers,
+            # we need to attend to all tokens in masked rows for `scaled_dot_product_attention`
+            bool_mask = torch.logical_not(torch.isneginf(attention_mask))
+            valid_tokens = torch.sum(bool_mask, dim=-1).bool()  # (..., q_len)
+            attention_mask = torch.where(valid_tokens[..., None], attention_mask, float(0.0))
+            attn_output = F.scaled_dot_product_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_mask=attention_mask,
+                enable_gqa=True,
+            )
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, q_len, self.q_num_heads * self.v_dim)
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_states
+class MLP(nn.Module):
+    def __init__(self, config: Plamo3Config) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_up_proj = torch.nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        h = self.gate_up_proj(x)
+        h = _swiglu(h)
+        return self.down_proj(h)  # type: ignore
+class Plamo3DecoderLayer(torch.nn.Module):
+    def __init__(self, config: Plamo3Config, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.mixer: torch.nn.Module
+        self.mixer = Attention(config, layer_idx)
+        self.mlp = MLP(config)
+        """
+        Notes: The model performance was degraded when setting all offsets to 1.
+        """
+        self.pre_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
+        self.post_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / 5)
+        self.pre_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
+        self.post_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / (5**1.5))
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_state: Optional[Plamo3Cache] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[Any, ...]:
+        # from LlamaDecoder
+        residual = hidden_states
+        hidden_states = self.pre_mixer_norm(hidden_states)
+        # Self Attention
+        hidden_states_sa, self_attn_weights, present_key_value = self.mixer(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            past_states=past_state,
+            output_attentions=output_attentions,
+        )
+        hidden_states_sa = self.post_mixer_norm(hidden_states_sa)
+        hidden_states = residual + hidden_states_sa
+        residual = hidden_states
+        hidden_states = self.pre_mlp_norm(hidden_states)
+        # Fully Connected
+        hidden_states_mlp = self.mlp(hidden_states)
+        # Residual
+        hidden_states_mlp = self.post_mlp_norm(hidden_states_mlp)
+        hidden_states = residual + hidden_states_mlp
+        outputs: Any = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs  # type: ignore
+class Plamo3Decoder(torch.nn.Module):
+    def __init__(self, config: Plamo3Config) -> None:
+        super().__init__()
+        self.layers = torch.nn.ModuleList(
+            [Plamo3DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+    def forward(self, x: DecoderInput) -> DecoderOutput:
+        all_hidden_states: Optional[Tuple[torch.Tensor, ...]] = () if x.output_hidden_states else None
+        all_self_attns: Optional[Tuple[torch.Tensor, ...]] = () if x.output_attentions else None
+        hidden_states = x.hidden_states
+        for decoder_layer in self.layers:
+            if x.output_hidden_states:
+                assert all_hidden_states is not None
+                all_hidden_states += (hidden_states,)
+            if self.training and x.gradient_checkpointing:
+                layer_outputs = self._gradient_checkpointing_func(  # type: ignore
+                    decoder_layer.__call__,
+                    hidden_states,
+                    x.attention_mask,
+                    x.past_states,
+                    x.output_attentions,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=x.attention_mask,
+                    past_state=x.past_states,
+                    output_attentions=x.output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if x.output_attentions:
+                assert layer_outputs[1] is not None
+                assert all_self_attns is not None
+                all_self_attns += (layer_outputs[1],)
+        return DecoderOutput(hidden_states, all_hidden_states, all_self_attns)
+class Plamo3PreTrainedModel(PreTrainedModel):  # type: ignore
+    config_class = Plamo3Config
+    _no_split_modules: List[str]
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["PlamoDecoderLayer"]
+    _skip_keys_device_placement = "past_key_values"
+    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
+    def _init_weights(self, module: torch.nn.Module) -> None:
+        std = 0.02
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+class Plamo3Model(Plamo3PreTrainedModel):
+    def __init__(self, config: Plamo3Config):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        if config.image_feature_size is not None:
+            if config.image_proj_type == "mlp":
+                self.image_proj = MLPImageProjector(config)  # type: ignore
+            elif config.image_proj_type == "linear":
+                self.image_proj = nn.Linear(config.image_feature_size, config.hidden_size, bias=False)  # type: ignore
+            else:
+                raise ValueError(f"Unknown image_proj_type: {config.image_proj_type}")
+        self.layers = Plamo3Decoder(config)  # type: ignore
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> torch.nn.Embedding:
+        return self.embed_tokens
+    def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
+        self.embed_tokens = value
+    def _prepare_decoder_attention_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_shape: Tuple[int, int],
+        inputs_embeds: torch.Tensor,
+        cache_position: torch.LongTensor,
+    ) -> Optional[torch.Tensor]:
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = _make_causal_mask(
+            input_shape,
+            inputs_embeds.dtype,
+            device=inputs_embeds.device,
+            seq_len=attention_mask.shape[-1],
+            cache_position=cache_position,
+        )
+        input_shape = (input_shape[0], combined_attention_mask.shape[2])
+        if attention_mask.dim() == 4:
+            # Custom 4D attention mask
+            expanded_attn_mask = attention_mask
+        else:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
+                inputs_embeds.device
+            )
+        combined_attention_mask = (
+            expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+        )
+        return combined_attention_mask
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Plamo3Cache | DynamicCache] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Any,
+    ) -> BaseModelOutputWithPast:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        # retrieve input_ids and inputs_embeds
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            use_cache = False
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        batch_size, seq_length, _ = inputs_embeds.shape
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values is not None:
+            # In some `transformers` versions, `past_key_values` may be a `DynamicCache` object.
+            if not isinstance(past_key_values, Plamo3Cache):
+                past_key_values_prev = past_key_values
+                past_key_values = Plamo3Cache(self.config)
+                for layer_idx in range(len(past_key_values_prev)):
+                    layer = past_key_values_prev.layers[layer_idx]
+                    if layer.keys is not None and layer.values is not None:
+                        past_key_values.update(layer.keys, layer.values, layer_idx=layer_idx)
+            assert isinstance(past_key_values, Plamo3Cache)
+            past_key_values_length = past_key_values.get_seq_length()
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+        if cache_position is None:
+            cache_position = torch.arange(
+                past_key_values_length,
+                past_key_values_length + seq_length,
+                device=inputs_embeds.device,
+            )  # type: ignore
+        if image_features is not None:
+            assert self.config.image_token_id is not None
+            image_embeds = self.image_proj(image_features)
+            assert image_embeds.shape == inputs_embeds.shape, (image_embeds.shape, inputs_embeds.shape)
+            mask = input_ids == self.config.image_token_id
+            inputs_embeds[mask] = image_embeds[mask]
+        # embed positions
+        require_attn_mask = False
+        if not self.training or past_key_values is not None:
+            require_attn_mask = True
+        if seq_length_with_past > self.config.window_size + 1:
+            require_attn_mask = True
+        if require_attn_mask and attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
+            )
+        if attention_mask is not None:
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask,
+                (batch_size, seq_length),
+                inputs_embeds,
+                cache_position,  # type: ignore
+            )
+        hidden_states = inputs_embeds
+        if use_cache and past_key_values is None:
+            past_key_values = Plamo3Cache(self.config)
+        # decoder layers
+        out = self.layers(
+            DecoderInput(
+                hidden_states,
+                attention_mask,
+                past_key_values,
+                output_hidden_states,
+                output_attentions,
+                self.gradient_checkpointing,
+            )
+        )
+        assert isinstance(out, DecoderOutput)
+        hidden_states = out.hidden_states
+        all_hidden_states = out.all_hidden_states
+        all_self_attns = out.all_self_attns
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            assert all_hidden_states is not None
+            all_hidden_states += (hidden_states,)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+class Plamo3ForCausalLM(Plamo3PreTrainedModel, GenerationMixin):  # type: ignore
+    _tied_weights_keys = ["lm_head.weight"]
+    # Without this, the model cannot be loaded into a meta device.
+    # Relevant code:
+    # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L4376-L4381
+    # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L356
+    # https://github.com/pytorch/pytorch/blob/v2.4.1/torch/nn/modules/module.py#L2068
+    _supports_param_buffer_assignment = False
+    def __init__(self, config: Plamo3Config) -> None:
+        super().__init__(config)
+        self.model = Plamo3Model(config)
+        self.vocab_size = config.vocab_size
+        vocab_size = ((self.vocab_size + 15) // 16) * 16
+        self.lm_head: torch.nn.Module = nn.Linear(config.hidden_size, vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> torch.nn.Embedding:
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
+        self.model.embed_tokens = value
+    def get_output_embeddings(self) -> torch.nn.Module:
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings: torch.nn.Module) -> None:
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder: Plamo3Model) -> None:
+        self.model = decoder
+    def get_decoder(self) -> Plamo3Model:
+        return self.model
+    def forward(  # type: ignore
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Plamo3Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: int | torch.Tensor = 0,
+        **kwargs: Any,
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+        >>> prompt = "Hey, are you consciours? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            image_features=image_features,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = logits[:, slice_indices, : self.vocab_size]
+        loss = None
+        if labels is not None:
+            if len(kwargs) > 0 and set(kwargs.keys()) != set(["ignore_index"]):
+                warnings.warn(
+                    f"The following kwargs may not be supported: {', '.join(kwargs.keys())}. ",
+                    stacklevel=2,
+                )
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.Tensor,
+        past_key_values: Optional[Plamo3Cache] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        **kwargs: Any,
+    ) -> Dict[str, Any]:
+        if past_key_values and all(k.keys is not None for k in past_key_values.layers):
+            input_ids = input_ids[:, -1:]
+            if image_features is not None:
+                image_features = image_features[:, -1:, :]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs: Dict[str, Any] = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "output_attentions": kwargs.get("output_attentions"),
+                "output_hidden_states": kwargs.get("output_hidden_states"),
+                "logits_to_keep": kwargs.get("logits_to_keep"),
+                "attention_mask": attention_mask,
+                "image_features": image_features,
+            }
+        )
+        return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values: Plamo3Cache, beam_idx: torch.Tensor) -> Plamo3Cache:
+        past_key_values.reorder_cache(beam_idx)
+        return past_key_values
+class MLPImageProjector(nn.Module):
+    def __init__(self, config: Plamo3Config) -> None:
+        super().__init__()
+        self.config = config
+        assert config.image_feature_size is not None  # for typing
+        # nn.LayerNorm is not supported by PFVM, so use RMSNorm + Bias instead to approximate this.
+        self.norm0 = RMSNorm(config.image_feature_size, eps=config.rms_norm_eps)
+        self.bias0 = Bias(config.image_feature_size)
+        # PFVM doesn't support Linear with bias, so add bias manually afterwards.
+        self.linear1 = nn.Linear(config.image_feature_size, config.hidden_size, bias=False)
+        self.bias1 = Bias(config.hidden_size)
+        self.act1 = nn.GELU()
+        self.linear2 = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.bias2 = Bias(config.hidden_size)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+    ) -> torch.Tensor:
+        hidden_states = self.norm0(hidden_states)
+        hidden_states = self.bias0(hidden_states)
+        hidden_states = self.linear1(hidden_states)
+        hidden_states = self.bias1(hidden_states)
+        hidden_states = self.act1(hidden_states)
+        hidden_states = self.linear2(hidden_states)
+        hidden_states = self.bias2(hidden_states)
+        return hidden_states
+class Bias(nn.Module):
+    def __init__(self, num_features: int) -> None:
+        super().__init__()
+        self._bias = nn.Parameter(torch.zeros((num_features,)))
+    def forward(
+        self,
+        x: torch.Tensor,
+    ) -> torch.Tensor:
+        return x + self._bias

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<|plamo:bos|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|plamo:eos|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|plamo:pad|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|plamo:unk|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenization_plamo.py ADDED Viewed

	@@ -0,0 +1,464 @@

+import json
+import math
+import os
+import re
+from shutil import copyfile
+from typing import Any
+import numpy as np
+# NOTE: numba does not support type hints for njit: https://github.com/python/mypy/issues/16149
+from numba import njit  # type: ignore[attr-defined]
+from numba.core import types  # type: ignore[import-untyped]
+from numba.typed import Dict
+from transformers.tokenization_utils import PreTrainedTokenizer
+from transformers.utils import logging
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.jsonl"}
+logger = logging.get_logger(__name__)
+INVALID_SCORE = -20000000
+UNKNOWN_SCORE = -10000000
+TABLE_PIECE_LENGTH = 0
+TABLE_TOKEN_ID = 1
+TABLE_SCORE = 2
+TABLE_PIECE_ID = 3
+PATH_TOKEN_LENGTH = 0
+PATH_TOKEN_ID = 1
+PATH_NUM_TOKENS = 2
+# In Unicode, the character U+EE00 is a private use character that is not assigned to any specific
+# character. This is internally used as a boundary character in tokenization.
+BOUNDARY_CHAR = "\uee00"
+BOUNDARY_TOKEN_ID = 10000000
+class AhoCorasick:
+    def __init__(self) -> None:
+        # List of tokens in the vocabulary.
+        self._tokens: list[str]
+        # A mapping from a byte code point to a token ID, used for byte fallback.
+        self._bytes: np.ndarray
+        # A mapping from a suffix's piece code to a suffix ID.
+        #
+        # Typically, the Aho-Corasick algorithm builds a Trie and adds suffix links between nodes
+        # of the Trie. In this implementation, a suffix ID corresponds to a node in the trie, and
+        # a piece code to an edge (in other words, a pair of a node and the next character).
+        #
+        # A piece code is a 64-bit integer:
+        # - The upper 32 bits store the Unicode code point of the first character.
+        # - The lower 32 bits store the suffix ID of the remaining suffix.
+        #
+        # A suffix ID is an integer indicating the starting position in the _table.
+        self._to_suffix_id: dict[np.int64, np.int32]  # numba.typed.Dict if jit is enabled
+        # Flattened table representing the Trie structure for the Aho-Corasick algorithm.
+        # It stores information including scores for each piece (prefix) within each suffix.
+        # It is flattened for memory efficiency and performance. Suffixes are stored in
+        # lexicographical order of their reversed strings, which improves memory access locality
+        # when exploring new characters starting from the string's end. Pieces within a suffix are
+        # stored in the decreasing order of their lengths.
+        #
+        # Each piece (a prefix fo the suffix) contains four pieces of information:
+        # - TABLE_PIECE_LENGTH: Length of the piece.
+        # - TABLE_TOKEN_ID: Token ID (or -1 if the piece is not a valid token).
+        # - TABLE_SCORE: Score (or INVALID_SCORE if the piece is not a valid token).
+        # - TABLE_PIECE_ID: Piece ID of the suffix.
+        #
+        # Each suffix also includes a sentinel row with a length of 1, a score of UNKNOWN_SCORE,
+        # and a token ID of -1. Sentinel rows are identified by the score being UNKNOWN_SCORE.
+        self._table: np.ndarray
+        # Regular expression matcher for identifying special tokens in the format <|plamo:*|>.
+        # Used to split text around special tokens during tokenization preprocessing.
+        self._sp_token_matcher: re.Pattern[str] | None = None
+        # Preprocessor to prevent boundary shifts in Unigram tokenization.
+        # The global DP in Unigram can create long lookahead dependencies, causing token boundaries
+        # to shift unexpectedly based on later context. While various sequences can trigger this,
+        # the most common culprits are long runs of spaces or repeated characters. This matcher
+        # finds sequences of two or more spaces or any character repeated four or more times, and
+        # forces hard splits immediately before and after each match, treating the span as its own
+        # token. By explicitly marking these boundaries, we eliminate most boundary jitter without
+        # trying to cover every rare case.
+        self._matcher: re.Pattern[str] | None = None
+    def build(
+        self,
+        vocab: list[Any],
+        *,
+        break_around_consecutive_spaces_threshold: int | None = None,
+        break_around_repeated_chars_threshold: int | None = None,
+    ) -> None:
+        """Build the Aho-Corasick data structure from vocabulary.
+        Args:
+            vocab: List of vocabulary entries, where each entry is [token, score, type, ...].
+            break_around_consecutive_spaces_threshold: Minimum number of consecutive spaces to trigger boundary splits.
+                If None, consecutive spaces won't trigger splits.
+            break_around_repeated_chars_threshold: Minimum number of repeated characters to trigger boundary splits.
+                If None, repeated characters won't trigger splits.
+        """
+        self._bytes = np.zeros(256, dtype=np.int32)
+        self._to_suffix_id = Dict.empty(key_type=types.int64, value_type=types.int32)
+        # Build suffix_to_score and token_to_token_id.
+        # The suffix_to_score dictionary maps a suffix to its score. It also includes all suffixes
+        # of the token for the Trie structure for the Aho-Corasick algorithm. If a suffix is not a
+        # valid token, its score is set to math.nan.
+        # The token_to_token_id dictionary maps a token to its token ID.
+        suffix_to_score: dict[str, float] = {}
+        token_to_token_id: dict[str, int] = {}
+        self._tokens = []
+        for token_id, row in list(enumerate(vocab)) + [(BOUNDARY_TOKEN_ID, [BOUNDARY_CHAR, 0, "CONTROL"])]:
+            assert isinstance(row[0], str), row
+            assert isinstance(row[1], (int, float)), row
+            token = str(row[0])
+            self._tokens.append(token)
+            token_to_token_id[token] = token_id
+            # Special handling for byte tokens.
+            if len(row) > 2 and row[2] == "BYTE":
+                assert len(token) == 6 and token.startswith("<0x") and token.endswith(">"), row[0]
+                self._bytes[int(row[0][3:5], 16)] = token_id
+                continue
+            suffix_to_score[token] = float(row[1])
+            # Ensure that all suffixes are included in suffix_to_score.
+            for i in range(1, len(token)):
+                suffix_to_score[token[i:]] = suffix_to_score.get(token[i:], math.nan)
+        # Ensure all byte tokens are set.
+        for i in range(256):
+            assert self._bytes[i] != 0, f"Byte token for <0x{i:02X}> is not set."
+        # Build a matcher for special tokens.
+        self._sp_token_matcher = re.compile(r"(<\|plamo:[^|\s]{,64}\|>)")
+        # Build matcher pattern to prevent boundary shifts.
+        patterns = []
+        if break_around_repeated_chars_threshold is not None:
+            patterns.append(f"(.)\\2{{{break_around_repeated_chars_threshold - 1},}}")
+        if break_around_consecutive_spaces_threshold is not None:
+            patterns.append(f" {{{break_around_consecutive_spaces_threshold},}}")
+        self._matcher = re.compile(f"({'|'.join(patterns)})") if patterns else None
+        # List suffixes in lexicographical order of their reversed strings.
+        suffixes = list(suffix_to_score.keys())
+        suffixes.append("")
+        suffixes.sort(key=lambda x: x[::-1])
+        # Build suffix_to_id, which is a mapping from a suffix to a suffix ID, and _to_suffix_id,
+        # which is a mapping from a piece code to a suffix ID.
+        suffix_to_id: dict[str, int] = {}
+        num_pieces = 0
+        for s in suffixes:
+            suffix_to_id[s] = num_pieces
+            if s != "":
+                self._to_suffix_id[
+                    ord(s[0]) << 32 | suffix_to_id[s[1:]]  # type: ignore[index]  # cast int to np.int64
+                ] = np.int32(num_pieces)
+            num_pieces += 1 + sum(s[:i] in suffix_to_score for i in range(1, len(s) + 1))
+        assert suffix_to_id[""] == 0, suffix_to_id[""]
+        # Build _table, which is a flattened table representing the Trie structure for the Aho-Corasick.
+        self._table = np.zeros((num_pieces, 4), dtype=np.int32)
+        i = 0
+        for suffix in suffixes:
+            # Add all prefixes of the suffix to the table.
+            for piece_length in range(len(suffix), 0, -1):
+                piece = suffix[:piece_length]
+                score = suffix_to_score.get(piece, None)
+                if score is None:
+                    continue
+                self._table[i, TABLE_PIECE_LENGTH] = piece_length
+                self._table[i, TABLE_TOKEN_ID] = token_to_token_id.get(piece, -1)
+                self._table[i, TABLE_SCORE] = round(score * 1e4) if math.isfinite(score) else INVALID_SCORE
+                self._table[i, TABLE_PIECE_ID] = suffix_to_id[piece]
+                i += 1
+            # Add a sentinel row.
+            self._table[i, TABLE_PIECE_LENGTH] = 1
+            self._table[i, TABLE_TOKEN_ID] = -1
+            self._table[i, TABLE_SCORE] = UNKNOWN_SCORE
+            i += 1
+        assert i == num_pieces, (i, num_pieces)
+    @staticmethod
+    @njit  # type: ignore[misc]  # untyped decorator
+    def _encode(
+        to_suffix_id: dict[np.int64, np.int32],  # numba.typed.Dict if jit is enabled
+        table: np.ndarray,
+        bytes: np.ndarray,
+        data: np.ndarray,
+    ) -> np.ndarray:
+        # Initialize scores array with a high value and set the score at the end to 0.
+        # This array keeps track of the minimum cost (best score) to encode from each position to the end.
+        scores = np.full((len(data) + 1,), 2**60, dtype=np.int64)
+        scores[-1] = 0
+        # Path array to store the best path information.
+        # The path array keeps track of token length, token ID, and number of tokens needed to encode.
+        path = np.zeros((len(data) + 1, 3), dtype=np.int32)
+        # Initialize suffix_id to 0, which represents the root of the Trie.
+        suffix_id = np.int32(0)
+        # Process the input data from the end to the beginning.
+        for i in range(len(data) - 1, -1, -1):
+            c: np.int32 = data[i]
+            # Find the next suffix ID by iterating the suffix IDs of prefixes of the current suffix.
+            # NOTE: If no suffix ID is found, suffix_id will be set to 0.
+            for p in range(suffix_id, len(table)):
+                suffix_id = to_suffix_id.get(np.int64(c) << 32 | table[p, TABLE_PIECE_ID], np.int32(0))
+                # If a next suffix ID is found or a sentinel row is reached, break the loop.
+                if suffix_id > 0 or table[p, TABLE_SCORE] == UNKNOWN_SCORE:
+                    break
+            # Update the best path to the current position. If multiple paths have the same score,
+            # this chooses the longest prefix as the best path (table is sorted in the decreasing
+            # order of piece length).
+            for p in range(suffix_id, len(table)):
+                score = table[p, TABLE_SCORE]
+                if score > INVALID_SCORE:
+                    piece_length = table[p, TABLE_PIECE_LENGTH]
+                    s = scores[i + piece_length] - score
+                    if s < scores[i]:
+                        scores[i] = s
+                        path[i, PATH_TOKEN_LENGTH] = piece_length
+                        path[i, PATH_TOKEN_ID] = table[p, TABLE_TOKEN_ID]
+                        path[i, PATH_NUM_TOKENS] = path[i + piece_length, PATH_NUM_TOKENS] + 1
+                        if score == UNKNOWN_SCORE:
+                            # Add number of bytes to represent `c` in UTF-8 (minus 1; 1 is already
+                            # added above).
+                            path[i, PATH_NUM_TOKENS] += (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
+                # If it reaches a sentinel row, break the loop.
+                if score == UNKNOWN_SCORE:
+                    break
+        # Decode the best path from the beginning to get the token IDs.
+        pos = 0
+        token_ids = np.zeros(path[0, PATH_NUM_TOKENS], dtype=np.int32)
+        token_pos = 0
+        while pos < len(data):
+            if path[pos, PATH_TOKEN_ID] >= 0:
+                token_ids[token_pos] = path[pos, PATH_TOKEN_ID]
+                if token_ids[token_pos] != BOUNDARY_TOKEN_ID:
+                    token_pos += 1
+            else:
+                # Fall back to byte tokens.
+                c = data[pos]
+                s = 1 + (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
+                # Add byte tokens representing UTF-8 bytes.
+                for i in range(s):
+                    b = c if s == 1 else (0xF00 >> s) & 0xFF if i == 0 else 0x80
+                    token_ids[token_pos] = bytes[b | ((c >> (s - i - 1) * 6) & 0x3F)]
+                    token_pos += 1
+            # Ensure that pos should increase by at least 1.
+            assert path[pos, PATH_TOKEN_LENGTH] > 0, (pos, path[pos])
+            pos += path[pos, PATH_TOKEN_LENGTH]
+        return token_ids[:token_pos]
+    def encode(self, data: str) -> np.ndarray:
+        """Encodes a string into a sequence of token IDs."""
+        if self._sp_token_matcher is not None:
+            data = self._sp_token_matcher.sub(BOUNDARY_CHAR + "\\1" + BOUNDARY_CHAR, data)
+        if self._matcher is not None:
+            data = self._matcher.sub(BOUNDARY_CHAR + "\\1" + BOUNDARY_CHAR, data)
+        return np.asarray(
+            self._encode(
+                self._to_suffix_id,
+                self._table,
+                self._bytes,
+                # Convert a string into a numpy array of Unicode code points.
+                # NOTE: This skips UTF-32 BOM.
+                np.frombuffer(data.encode("utf-32"), dtype=np.int32)[1:],
+            )
+        )
+    def encode_as_tokens(self, data: str) -> list[str]:
+        """Encodes a string into a sequence of tokens."""
+        return [self._tokens[token_id] for token_id in self.encode(data)]
+class Plamo3Tokenizer(PreTrainedTokenizer):
+    vocab_files_names = VOCAB_FILES_NAMES
+    model_input_names = ["input_ids", "attention_mask"]
+    _save_files = [
+        "special_tokens_map.json",
+        "tokenization_plamo.py",
+        "tokenizer.jsonl",
+        "tokenizer_config.json",
+    ]
+    def __init__(
+        self,
+        vocab_file: str,
+        unk_token: str = "<|plamo:unk|>",
+        bos_token: str = "<|plamo:bos|>",
+        eos_token: str = "<|plamo:eos|>",
+        pad_token: str = "<|plamo:pad|>",
+        cls_token: str | None = None,
+        sep_token: str | None = None,
+        mask_token: str | None = None,
+        clean_up_tokenization_spaces: bool = False,
+        break_around_consecutive_spaces_threshold: int | None = None,
+        break_around_repeated_chars_threshold: int | None = None,
+        **kwargs: Any,
+    ) -> None:
+        """Tokenizer for PLaMo.
+        Args:
+            vocab_file (str): Vocabrary file path.
+            unk_token (str): Unknown token.
+            bos_token (str): Beginning of sentence token.
+            eos_token (str): End of sentence token.
+            pad_token (str): Padding token.
+            cls_token (str):
+                Classification token, to extract a summary of an input sequence leveraging self-attention along the
+                full depth of the model.
+            sep_token (str): Separation token, to separate context and query in an input sequence.
+            mask_token (str): Mask token, to use when training a model with masked-language modeling.
+            clean_up_tokenization_spaces (bool): Whether or not to clean up the tokenization spaces.
+            break_around_consecutive_spaces_threshold (int, optional): Minimum number of consecutive spaces to trigger
+                boundary splits. If None, consecutive spaces won't trigger splits.
+            break_around_repeated_chars_threshold (int, optional): Minimum number of repeated characters to trigger
+                boundary splits. If None, repeated characters won't trigger splits.
+            num_threads (int):
+                Number of threads. This value will be ignored if one of `PLAMO_TOKENIZER_NUM_THREADS` or
+                `RAYON_NUM_THREADS` is set as an environment variable.
+        """
+        if "add_bos_token" not in kwargs:
+            kwargs["add_bos_token"] = False
+        if "add_eos_token" not in kwargs:
+            kwargs["add_eos_token"] = False
+        with open(vocab_file, encoding="utf-8") as f:
+            self.data: list[Any] = [json.loads(line) for line in f]
+        self.vocab: dict[str, int] = {v[0]: i for i, v in enumerate(self.data)}
+        self.aho_corasick = AhoCorasick()
+        self.break_around_consecutive_spaces_threshold = break_around_consecutive_spaces_threshold
+        self.break_around_repeated_chars_threshold = break_around_repeated_chars_threshold
+        self.aho_corasick.build(
+            self.data,
+            break_around_consecutive_spaces_threshold=self.break_around_consecutive_spaces_threshold,
+            break_around_repeated_chars_threshold=self.break_around_repeated_chars_threshold,
+        )
+        self.vocab_file = vocab_file
+        self.add_bos_token = kwargs["add_bos_token"]
+        self.add_eos_token = kwargs["add_eos_token"]
+        super().__init__(  # type: ignore[no-untyped-call]
+            vocab_file=vocab_file,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            sep_token=sep_token,
+            mask_token=mask_token,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            break_around_consecutive_spaces_threshold=break_around_consecutive_spaces_threshold,
+            break_around_repeated_chars_threshold=break_around_repeated_chars_threshold,
+            **kwargs,
+        )
+    # the functions below are copied from hf transformers LlamaTokenizer's implementation to fix the behaviour of the tokenizer
+    # https://github.com/huggingface/transformers/blob/v4.30.2/src/transformers/models/llama/tokenization_llama.py
+    def __getstate__(self) -> dict[str, Any]:
+        state = self.__dict__.copy()
+        state["aho_corasick"] = None
+        return state
+    def __setstate__(self, d: dict[str, Any]) -> None:
+        self.__dict__ = d
+        self.aho_corasick = AhoCorasick()
+        self.aho_corasick.build(
+            self.data,
+            break_around_consecutive_spaces_threshold=self.break_around_consecutive_spaces_threshold,
+            break_around_repeated_chars_threshold=self.break_around_repeated_chars_threshold,
+        )
+    @property
+    def vocab_size(self) -> Any:
+        """Returns vocab size"""
+        return len(self.data)
+    def token_to_score(self, token: str) -> float | None:
+        """Returns score of the token"""
+        token_id = self.vocab.get(token, None)
+        return None if token_id is None else self.data[token_id][1]
+    def get_vocab(self) -> dict[str, int]:
+        """Returns vocab as a dict"""
+        vocab = self.vocab.copy()
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+    def convert_tokens_to_string(self, tokens: list[str]) -> str:
+        """Converts a sequence of tokens (string) in a single string."""
+        return b"".join(
+            [bytes([int(t[3:5], 16)]) if t.startswith("<0x") else t.encode("utf-8") for t in tokens]
+        ).decode("utf-8", errors="replace")
+    def _tokenize(self, text: str, **kwargs: Any) -> list[str]:
+        """Returns a tokenized string."""
+        return self.aho_corasick.encode_as_tokens(text)
+    def _convert_token_to_id(self, token: str) -> int:
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, 0)
+    def _convert_id_to_token(self, index: int) -> str:
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.data[index][0]  # type: ignore[no-any-return]
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: list[int], token_ids_1: list[int] | None = None
+    ) -> list[int]:
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+        output = bos_token_id + token_ids_0 + eos_token_id
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+        return output
+    def save_vocabulary(self, save_directory: str, filename_prefix: str | None = None) -> tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return ("",)
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "w") as f:
+                for token in self.data:
+                    print(json.dumps(token, ensure_ascii=False), file=f)
+        return (out_vocab_file,)

tokenizer.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|plamo:unk|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|plamo:bos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|plamo:eos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<|plamo:pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_plamo.Plamo3Tokenizer",
+      null
+    ]
+  },
+  "bos_token": "<|plamo:bos|>",
+  "break_around_consecutive_spaces_threshold": 2,
+  "break_around_repeated_chars_threshold": 4,
+  "clean_up_tokenization_spaces": false,
+  "cls_token": null,
+  "eos_token": "<|plamo:eos|>",
+  "extra_special_tokens": {},
+  "local_file_only": true,
+  "mask_token": null,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|plamo:pad|>",
+  "padding_side": "right",
+  "sep_token": null,
+  "split_special_tokens": false,
+  "tokenizer_class": "Plamo3Tokenizer",
+  "unk_token": "<|plamo:unk|>"
+}

train_results.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+    "epoch": 2.0,
+    "total_flos": 93553682546688.0,
+    "train_loss": 0.9335812793004201,
+    "train_runtime": 7476.0796,
+    "train_samples_per_second": 17.685,
+    "train_steps_per_second": 0.276
+}

trainer_log.jsonl ADDED Viewed

	@@ -0,0 +1,228 @@

+{"current_steps": 10, "total_steps": 2066, "loss": 1.7236, "lr": 7.258064516129033e-07, "epoch": 0.00968054211035818, "percentage": 0.48, "elapsed_time": "0:00:36", "remaining_time": "2:04:00"}
+{"current_steps": 20, "total_steps": 2066, "loss": 1.627, "lr": 1.5322580645161292e-06, "epoch": 0.01936108422071636, "percentage": 0.97, "elapsed_time": "0:01:11", "remaining_time": "2:01:15"}
+{"current_steps": 30, "total_steps": 2066, "loss": 1.4318, "lr": 2.338709677419355e-06, "epoch": 0.02904162633107454, "percentage": 1.45, "elapsed_time": "0:01:46", "remaining_time": "2:00:55"}
+{"current_steps": 40, "total_steps": 2066, "loss": 1.3885, "lr": 3.145161290322581e-06, "epoch": 0.03872216844143272, "percentage": 1.94, "elapsed_time": "0:02:20", "remaining_time": "1:59:01"}
+{"current_steps": 50, "total_steps": 2066, "loss": 1.3955, "lr": 3.951612903225807e-06, "epoch": 0.0484027105517909, "percentage": 2.42, "elapsed_time": "0:02:59", "remaining_time": "2:00:19"}
+{"current_steps": 60, "total_steps": 2066, "loss": 1.2718, "lr": 4.758064516129033e-06, "epoch": 0.05808325266214908, "percentage": 2.9, "elapsed_time": "0:03:32", "remaining_time": "1:58:09"}
+{"current_steps": 70, "total_steps": 2066, "loss": 1.3654, "lr": 4.999849475897687e-06, "epoch": 0.06776379477250725, "percentage": 3.39, "elapsed_time": "0:04:11", "remaining_time": "1:59:35"}
+{"current_steps": 80, "total_steps": 2066, "loss": 1.2831, "lr": 4.999112258623345e-06, "epoch": 0.07744433688286544, "percentage": 3.87, "elapsed_time": "0:04:45", "remaining_time": "1:58:13"}
+{"current_steps": 90, "total_steps": 2066, "loss": 1.3002, "lr": 4.997760881838323e-06, "epoch": 0.08712487899322362, "percentage": 4.36, "elapsed_time": "0:05:17", "remaining_time": "1:56:21"}
+{"current_steps": 100, "total_steps": 2066, "loss": 1.287, "lr": 4.995795677644913e-06, "epoch": 0.0968054211035818, "percentage": 4.84, "elapsed_time": "0:05:51", "remaining_time": "1:55:03"}
+{"current_steps": 110, "total_steps": 2066, "loss": 1.2492, "lr": 4.993217128994149e-06, "epoch": 0.10648596321393998, "percentage": 5.32, "elapsed_time": "0:07:17", "remaining_time": "2:09:35"}
+{"current_steps": 120, "total_steps": 2066, "loss": 1.2794, "lr": 4.9900258695671176e-06, "epoch": 0.11616650532429816, "percentage": 5.81, "elapsed_time": "0:07:53", "remaining_time": "2:08:05"}
+{"current_steps": 130, "total_steps": 2066, "loss": 1.2506, "lr": 4.986222683619237e-06, "epoch": 0.12584704743465633, "percentage": 6.29, "elapsed_time": "0:08:28", "remaining_time": "2:06:16"}
+{"current_steps": 140, "total_steps": 2066, "loss": 1.2609, "lr": 4.981808505787523e-06, "epoch": 0.1355275895450145, "percentage": 6.78, "elapsed_time": "0:09:01", "remaining_time": "2:04:14"}
+{"current_steps": 150, "total_steps": 2066, "loss": 1.2329, "lr": 4.976784420860898e-06, "epoch": 0.1452081316553727, "percentage": 7.26, "elapsed_time": "0:09:35", "remaining_time": "2:02:31"}
+{"current_steps": 160, "total_steps": 2066, "loss": 1.3551, "lr": 4.971151663513608e-06, "epoch": 0.15488867376573087, "percentage": 7.74, "elapsed_time": "0:10:12", "remaining_time": "2:01:34"}
+{"current_steps": 170, "total_steps": 2066, "loss": 1.261, "lr": 4.964911618001794e-06, "epoch": 0.16456921587608905, "percentage": 8.23, "elapsed_time": "0:10:45", "remaining_time": "1:59:54"}
+{"current_steps": 180, "total_steps": 2066, "loss": 1.2055, "lr": 4.958065817823318e-06, "epoch": 0.17424975798644723, "percentage": 8.71, "elapsed_time": "0:11:18", "remaining_time": "1:58:31"}
+{"current_steps": 190, "total_steps": 2066, "loss": 1.3022, "lr": 4.950615945340893e-06, "epoch": 0.18393030009680542, "percentage": 9.2, "elapsed_time": "0:11:52", "remaining_time": "1:57:15"}
+{"current_steps": 200, "total_steps": 2066, "loss": 1.2701, "lr": 4.942563831368653e-06, "epoch": 0.1936108422071636, "percentage": 9.68, "elapsed_time": "0:12:26", "remaining_time": "1:56:02"}
+{"current_steps": 200, "total_steps": 2066, "eval_loss": 1.3192224502563477, "epoch": 0.1936108422071636, "percentage": 9.68, "elapsed_time": "0:12:37", "remaining_time": "1:57:46"}
+{"current_steps": 210, "total_steps": 2066, "loss": 1.277, "lr": 4.933911454722217e-06, "epoch": 0.20329138431752178, "percentage": 10.16, "elapsed_time": "0:14:39", "remaining_time": "2:09:33"}
+{"current_steps": 220, "total_steps": 2066, "loss": 1.2418, "lr": 4.924660941732403e-06, "epoch": 0.21297192642787996, "percentage": 10.65, "elapsed_time": "0:15:16", "remaining_time": "2:08:09"}
+{"current_steps": 230, "total_steps": 2066, "loss": 1.294, "lr": 4.914814565722671e-06, "epoch": 0.22265246853823814, "percentage": 11.13, "elapsed_time": "0:15:48", "remaining_time": "2:06:13"}
+{"current_steps": 240, "total_steps": 2066, "loss": 1.2823, "lr": 4.9043747464504586e-06, "epoch": 0.23233301064859632, "percentage": 11.62, "elapsed_time": "0:16:28", "remaining_time": "2:05:24"}
+{"current_steps": 250, "total_steps": 2066, "loss": 1.2753, "lr": 4.893344049512519e-06, "epoch": 0.2420135527589545, "percentage": 12.1, "elapsed_time": "0:17:04", "remaining_time": "2:04:02"}
+{"current_steps": 260, "total_steps": 2066, "loss": 1.1851, "lr": 4.881725185714421e-06, "epoch": 0.25169409486931266, "percentage": 12.58, "elapsed_time": "0:17:37", "remaining_time": "2:02:22"}
+{"current_steps": 270, "total_steps": 2066, "loss": 1.2901, "lr": 4.869521010404373e-06, "epoch": 0.26137463697967084, "percentage": 13.07, "elapsed_time": "0:18:13", "remaining_time": "2:01:16"}
+{"current_steps": 280, "total_steps": 2066, "loss": 1.246, "lr": 4.856734522771512e-06, "epoch": 0.271055179090029, "percentage": 13.55, "elapsed_time": "0:18:56", "remaining_time": "2:00:47"}
+{"current_steps": 290, "total_steps": 2066, "loss": 1.204, "lr": 4.843368865108847e-06, "epoch": 0.2807357212003872, "percentage": 14.04, "elapsed_time": "0:19:30", "remaining_time": "1:59:30"}
+{"current_steps": 300, "total_steps": 2066, "loss": 1.271, "lr": 4.8294273220410494e-06, "epoch": 0.2904162633107454, "percentage": 14.52, "elapsed_time": "0:20:06", "remaining_time": "1:58:23"}
+{"current_steps": 310, "total_steps": 2066, "loss": 1.307, "lr": 4.814913319717238e-06, "epoch": 0.30009680542110356, "percentage": 15.0, "elapsed_time": "0:28:47", "remaining_time": "2:43:06"}
+{"current_steps": 320, "total_steps": 2066, "loss": 1.273, "lr": 4.799830424969008e-06, "epoch": 0.30977734753146174, "percentage": 15.49, "elapsed_time": "0:29:21", "remaining_time": "2:40:13"}
+{"current_steps": 330, "total_steps": 2066, "loss": 1.2719, "lr": 4.784182344433878e-06, "epoch": 0.3194578896418199, "percentage": 15.97, "elapsed_time": "0:29:56", "remaining_time": "2:37:32"}
+{"current_steps": 340, "total_steps": 2066, "loss": 1.2732, "lr": 4.767972923644377e-06, "epoch": 0.3291384317521781, "percentage": 16.46, "elapsed_time": "0:30:32", "remaining_time": "2:35:01"}
+{"current_steps": 350, "total_steps": 2066, "loss": 1.3289, "lr": 4.751206146083002e-06, "epoch": 0.3388189738625363, "percentage": 16.94, "elapsed_time": "0:31:06", "remaining_time": "2:32:29"}
+{"current_steps": 360, "total_steps": 2066, "loss": 1.2303, "lr": 4.7338861322032724e-06, "epoch": 0.34849951597289447, "percentage": 17.42, "elapsed_time": "0:31:38", "remaining_time": "2:29:56"}
+{"current_steps": 370, "total_steps": 2066, "loss": 1.1788, "lr": 4.716017138417126e-06, "epoch": 0.35818005808325265, "percentage": 17.91, "elapsed_time": "0:32:12", "remaining_time": "2:27:37"}
+{"current_steps": 380, "total_steps": 2066, "loss": 1.2543, "lr": 4.697603556048899e-06, "epoch": 0.36786060019361083, "percentage": 18.39, "elapsed_time": "0:32:44", "remaining_time": "2:25:18"}
+{"current_steps": 390, "total_steps": 2066, "loss": 1.3091, "lr": 4.6786499102561525e-06, "epoch": 0.377541142303969, "percentage": 18.88, "elapsed_time": "0:33:18", "remaining_time": "2:23:07"}
+{"current_steps": 400, "total_steps": 2066, "loss": 1.2693, "lr": 4.659160858917614e-06, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:34:01", "remaining_time": "2:21:44"}
+{"current_steps": 400, "total_steps": 2066, "eval_loss": 1.3101810216903687, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:34:13", "remaining_time": "2:22:34"}
+{"current_steps": 310, "total_steps": 2066, "loss": 1.307, "lr": 4.814913319717238e-06, "epoch": 0.30009680542110356, "percentage": 15.0, "elapsed_time": "0:00:38", "remaining_time": "0:03:35"}
+{"current_steps": 320, "total_steps": 2066, "loss": 1.273, "lr": 4.799830424969008e-06, "epoch": 0.30977734753146174, "percentage": 15.49, "elapsed_time": "0:01:11", "remaining_time": "0:06:29"}
+{"current_steps": 330, "total_steps": 2066, "loss": 1.2719, "lr": 4.784182344433878e-06, "epoch": 0.3194578896418199, "percentage": 15.97, "elapsed_time": "0:01:45", "remaining_time": "0:09:16"}
+{"current_steps": 340, "total_steps": 2066, "loss": 1.2732, "lr": 4.767972923644377e-06, "epoch": 0.3291384317521781, "percentage": 16.46, "elapsed_time": "0:02:20", "remaining_time": "0:11:52"}
+{"current_steps": 350, "total_steps": 2066, "loss": 1.3289, "lr": 4.751206146083002e-06, "epoch": 0.3388189738625363, "percentage": 16.94, "elapsed_time": "0:02:54", "remaining_time": "0:14:14"}
+{"current_steps": 360, "total_steps": 2066, "loss": 1.2303, "lr": 4.7338861322032724e-06, "epoch": 0.34849951597289447, "percentage": 17.42, "elapsed_time": "0:03:26", "remaining_time": "0:16:18"}
+{"current_steps": 370, "total_steps": 2066, "loss": 1.1788, "lr": 4.716017138417126e-06, "epoch": 0.35818005808325265, "percentage": 17.91, "elapsed_time": "0:04:00", "remaining_time": "0:18:22"}
+{"current_steps": 380, "total_steps": 2066, "loss": 1.2543, "lr": 4.697603556048899e-06, "epoch": 0.36786060019361083, "percentage": 18.39, "elapsed_time": "0:04:32", "remaining_time": "0:20:10"}
+{"current_steps": 390, "total_steps": 2066, "loss": 1.3091, "lr": 4.6786499102561525e-06, "epoch": 0.377541142303969, "percentage": 18.88, "elapsed_time": "0:05:05", "remaining_time": "0:21:52"}
+{"current_steps": 400, "total_steps": 2066, "loss": 1.2693, "lr": 4.659160858917614e-06, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:05:48", "remaining_time": "0:24:10"}
+{"current_steps": 400, "total_steps": 2066, "eval_loss": 1.3101810216903687, "epoch": 0.3872216844143272, "percentage": 19.36, "elapsed_time": "0:05:59", "remaining_time": "0:24:56"}
+{"current_steps": 410, "total_steps": 2066, "loss": 1.2866, "lr": 4.639141191488498e-06, "epoch": 0.3969022265246854, "percentage": 19.85, "elapsed_time": "0:07:46", "remaining_time": "0:31:25"}
+{"current_steps": 420, "total_steps": 2066, "loss": 1.3088, "lr": 4.618595827823486e-06, "epoch": 0.40658276863504356, "percentage": 20.33, "elapsed_time": "0:08:32", "remaining_time": "0:33:27"}
+{"current_steps": 430, "total_steps": 2066, "loss": 1.2445, "lr": 4.597529816967676e-06, "epoch": 0.41626331074540174, "percentage": 20.81, "elapsed_time": "0:09:04", "remaining_time": "0:34:31"}
+{"current_steps": 440, "total_steps": 2066, "loss": 1.2679, "lr": 4.575948335915769e-06, "epoch": 0.4259438528557599, "percentage": 21.3, "elapsed_time": "0:09:43", "remaining_time": "0:35:58"}
+{"current_steps": 450, "total_steps": 2066, "loss": 1.2699, "lr": 4.553856688339817e-06, "epoch": 0.4356243949661181, "percentage": 21.78, "elapsed_time": "0:10:24", "remaining_time": "0:37:22"}
+{"current_steps": 460, "total_steps": 2066, "loss": 1.2381, "lr": 4.531260303285841e-06, "epoch": 0.4453049370764763, "percentage": 22.27, "elapsed_time": "0:10:59", "remaining_time": "0:38:23"}
+{"current_steps": 470, "total_steps": 2066, "loss": 1.3089, "lr": 4.50816473383964e-06, "epoch": 0.45498547918683446, "percentage": 22.75, "elapsed_time": "0:11:49", "remaining_time": "0:40:10"}
+{"current_steps": 480, "total_steps": 2066, "loss": 1.2271, "lr": 4.484575655762107e-06, "epoch": 0.46466602129719264, "percentage": 23.23, "elapsed_time": "0:12:21", "remaining_time": "0:40:50"}
+{"current_steps": 490, "total_steps": 2066, "loss": 1.2136, "lr": 4.460498866094412e-06, "epoch": 0.4743465634075508, "percentage": 23.72, "elapsed_time": "0:12:57", "remaining_time": "0:41:39"}
+{"current_steps": 500, "total_steps": 2066, "loss": 1.2747, "lr": 4.435940281733369e-06, "epoch": 0.484027105517909, "percentage": 24.2, "elapsed_time": "0:13:30", "remaining_time": "0:42:17"}
+{"current_steps": 510, "total_steps": 2066, "loss": 1.265, "lr": 4.410905937977353e-06, "epoch": 0.4937076476282672, "percentage": 24.69, "elapsed_time": "0:15:07", "remaining_time": "0:46:10"}
+{"current_steps": 520, "total_steps": 2066, "loss": 1.2895, "lr": 4.385401987043118e-06, "epoch": 0.5033881897386253, "percentage": 25.17, "elapsed_time": "0:15:39", "remaining_time": "0:46:34"}
+{"current_steps": 530, "total_steps": 2066, "loss": 1.2376, "lr": 4.359434696553889e-06, "epoch": 0.5130687318489835, "percentage": 25.65, "elapsed_time": "0:16:14", "remaining_time": "0:47:03"}
+{"current_steps": 540, "total_steps": 2066, "loss": 1.2575, "lr": 4.333010447999077e-06, "epoch": 0.5227492739593417, "percentage": 26.14, "elapsed_time": "0:16:48", "remaining_time": "0:47:28"}
+{"current_steps": 550, "total_steps": 2066, "loss": 1.267, "lr": 4.3061357351660285e-06, "epoch": 0.5324298160696999, "percentage": 26.62, "elapsed_time": "0:17:28", "remaining_time": "0:48:10"}
+{"current_steps": 560, "total_steps": 2066, "loss": 1.2584, "lr": 4.27881716254417e-06, "epoch": 0.542110358180058, "percentage": 27.11, "elapsed_time": "0:18:04", "remaining_time": "0:48:36"}
+{"current_steps": 570, "total_steps": 2066, "loss": 1.2263, "lr": 4.251061443701941e-06, "epoch": 0.5517909002904162, "percentage": 27.59, "elapsed_time": "0:18:42", "remaining_time": "0:49:07"}
+{"current_steps": 580, "total_steps": 2066, "loss": 1.2231, "lr": 4.222875399636938e-06, "epoch": 0.5614714424007744, "percentage": 28.07, "elapsed_time": "0:19:14", "remaining_time": "0:49:18"}
+{"current_steps": 590, "total_steps": 2066, "loss": 1.2656, "lr": 4.194265957099638e-06, "epoch": 0.5711519845111326, "percentage": 28.56, "elapsed_time": "0:19:47", "remaining_time": "0:49:29"}
+{"current_steps": 600, "total_steps": 2066, "loss": 1.2341, "lr": 4.165240146891145e-06, "epoch": 0.5808325266214908, "percentage": 29.04, "elapsed_time": "0:20:20", "remaining_time": "0:49:41"}
+{"current_steps": 600, "total_steps": 2066, "eval_loss": 1.3036646842956543, "epoch": 0.5808325266214908, "percentage": 29.04, "elapsed_time": "0:20:31", "remaining_time": "0:50:08"}
+{"current_steps": 610, "total_steps": 2066, "loss": 1.2413, "lr": 4.1358051021353655e-06, "epoch": 0.590513068731849, "percentage": 29.53, "elapsed_time": "0:22:10", "remaining_time": "0:52:55"}
+{"current_steps": 620, "total_steps": 2066, "loss": 1.2342, "lr": 4.1059680565260315e-06, "epoch": 0.6001936108422071, "percentage": 30.01, "elapsed_time": "0:22:53", "remaining_time": "0:53:22"}
+{"current_steps": 630, "total_steps": 2066, "loss": 1.1899, "lr": 4.0757363425490185e-06, "epoch": 0.6098741529525653, "percentage": 30.49, "elapsed_time": "0:23:28", "remaining_time": "0:53:30"}
+{"current_steps": 640, "total_steps": 2066, "loss": 1.1912, "lr": 4.04511738968037e-06, "epoch": 0.6195546950629235, "percentage": 30.98, "elapsed_time": "0:24:12", "remaining_time": "0:53:57"}
+{"current_steps": 650, "total_steps": 2066, "loss": 1.2066, "lr": 4.0141187225605064e-06, "epoch": 0.6292352371732817, "percentage": 31.46, "elapsed_time": "0:24:50", "remaining_time": "0:54:07"}
+{"current_steps": 660, "total_steps": 2066, "loss": 1.2394, "lr": 3.98274795914503e-06, "epoch": 0.6389157792836399, "percentage": 31.95, "elapsed_time": "0:25:22", "remaining_time": "0:54:03"}
+{"current_steps": 670, "total_steps": 2066, "loss": 1.2069, "lr": 3.951012808832603e-06, "epoch": 0.648596321393998, "percentage": 32.43, "elapsed_time": "0:25:57", "remaining_time": "0:54:04"}
+{"current_steps": 680, "total_steps": 2066, "loss": 1.2724, "lr": 3.918921070570361e-06, "epoch": 0.6582768635043562, "percentage": 32.91, "elapsed_time": "0:26:31", "remaining_time": "0:54:03"}
+{"current_steps": 690, "total_steps": 2066, "loss": 1.3105, "lr": 3.886480630937307e-06, "epoch": 0.6679574056147144, "percentage": 33.4, "elapsed_time": "0:27:02", "remaining_time": "0:53:55"}
+{"current_steps": 700, "total_steps": 2066, "loss": 1.1989, "lr": 3.853699462206183e-06, "epoch": 0.6776379477250726, "percentage": 33.88, "elapsed_time": "0:27:35", "remaining_time": "0:53:50"}
+{"current_steps": 710, "total_steps": 2066, "loss": 1.3256, "lr": 3.820585620384265e-06, "epoch": 0.6873184898354308, "percentage": 34.37, "elapsed_time": "0:29:14", "remaining_time": "0:55:50"}
+{"current_steps": 720, "total_steps": 2066, "loss": 1.2206, "lr": 3.787147243233602e-06, "epoch": 0.6969990319457889, "percentage": 34.85, "elapsed_time": "0:29:46", "remaining_time": "0:55:39"}
+{"current_steps": 730, "total_steps": 2066, "loss": 1.2245, "lr": 3.753392548271144e-06, "epoch": 0.7066795740561471, "percentage": 35.33, "elapsed_time": "0:30:18", "remaining_time": "0:55:28"}
+{"current_steps": 740, "total_steps": 2066, "loss": 1.2685, "lr": 3.7193298307492855e-06, "epoch": 0.7163601161665053, "percentage": 35.82, "elapsed_time": "0:30:56", "remaining_time": "0:55:26"}
+{"current_steps": 750, "total_steps": 2066, "loss": 1.2379, "lr": 3.6849674616172887e-06, "epoch": 0.7260406582768635, "percentage": 36.3, "elapsed_time": "0:31:28", "remaining_time": "0:55:12"}
+{"current_steps": 760, "total_steps": 2066, "loss": 1.2176, "lr": 3.6503138854641257e-06, "epoch": 0.7357212003872217, "percentage": 36.79, "elapsed_time": "0:32:00", "remaining_time": "0:54:59"}
+{"current_steps": 770, "total_steps": 2066, "loss": 1.2751, "lr": 3.615377618443201e-06, "epoch": 0.7454017424975798, "percentage": 37.27, "elapsed_time": "0:32:33", "remaining_time": "0:54:47"}
+{"current_steps": 780, "total_steps": 2066, "loss": 1.2335, "lr": 3.5801672461795032e-06, "epoch": 0.755082284607938, "percentage": 37.75, "elapsed_time": "0:33:05", "remaining_time": "0:54:32"}
+{"current_steps": 790, "total_steps": 2066, "loss": 1.2816, "lr": 3.5446914216596805e-06, "epoch": 0.7647628267182962, "percentage": 38.24, "elapsed_time": "0:33:36", "remaining_time": "0:54:17"}
+{"current_steps": 800, "total_steps": 2066, "loss": 1.1997, "lr": 3.5089588631055527e-06, "epoch": 0.7744433688286544, "percentage": 38.72, "elapsed_time": "0:34:08", "remaining_time": "0:54:01"}
+{"current_steps": 800, "total_steps": 2066, "eval_loss": 1.2973461151123047, "epoch": 0.7744433688286544, "percentage": 38.72, "elapsed_time": "0:34:19", "remaining_time": "0:54:19"}
+{"current_steps": 810, "total_steps": 2066, "loss": 1.2153, "lr": 3.472978351831606e-06, "epoch": 0.7841239109390126, "percentage": 39.21, "elapsed_time": "0:35:55", "remaining_time": "0:55:43"}
+{"current_steps": 820, "total_steps": 2066, "loss": 1.1981, "lr": 3.436758730086971e-06, "epoch": 0.7938044530493708, "percentage": 39.69, "elapsed_time": "0:36:31", "remaining_time": "0:55:30"}
+{"current_steps": 830, "total_steps": 2066, "loss": 1.2271, "lr": 3.4003088988824323e-06, "epoch": 0.8034849951597289, "percentage": 40.17, "elapsed_time": "0:37:02", "remaining_time": "0:55:09"}
+{"current_steps": 840, "total_steps": 2066, "loss": 1.2394, "lr": 3.363637815802998e-06, "epoch": 0.8131655372700871, "percentage": 40.66, "elapsed_time": "0:37:34", "remaining_time": "0:54:50"}
+{"current_steps": 850, "total_steps": 2066, "loss": 1.2334, "lr": 3.326754492806559e-06, "epoch": 0.8228460793804453, "percentage": 41.14, "elapsed_time": "0:38:06", "remaining_time": "0:54:31"}
+{"current_steps": 860, "total_steps": 2066, "loss": 1.2327, "lr": 3.2896679940091913e-06, "epoch": 0.8325266214908035, "percentage": 41.63, "elapsed_time": "0:38:39", "remaining_time": "0:54:13"}
+{"current_steps": 870, "total_steps": 2066, "loss": 1.2282, "lr": 3.2523874334576456e-06, "epoch": 0.8422071636011617, "percentage": 42.11, "elapsed_time": "0:39:11", "remaining_time": "0:53:52"}
+{"current_steps": 880, "total_steps": 2066, "loss": 1.2132, "lr": 3.214921972889552e-06, "epoch": 0.8518877057115198, "percentage": 42.59, "elapsed_time": "0:39:42", "remaining_time": "0:53:30"}
+{"current_steps": 890, "total_steps": 2066, "loss": 1.2473, "lr": 3.17728081948192e-06, "epoch": 0.861568247821878, "percentage": 43.08, "elapsed_time": "0:40:17", "remaining_time": "0:53:14"}
+{"current_steps": 900, "total_steps": 2066, "loss": 1.2524, "lr": 3.139473223588462e-06, "epoch": 0.8712487899322362, "percentage": 43.56, "elapsed_time": "0:40:56", "remaining_time": "0:53:01"}
+{"current_steps": 910, "total_steps": 2066, "loss": 1.2423, "lr": 3.1015084764663074e-06, "epoch": 0.8809293320425944, "percentage": 44.05, "elapsed_time": "0:42:33", "remaining_time": "0:54:03"}
+{"current_steps": 920, "total_steps": 2066, "loss": 1.1997, "lr": 3.063395907992671e-06, "epoch": 0.8906098741529526, "percentage": 44.53, "elapsed_time": "0:43:20", "remaining_time": "0:53:59"}
+{"current_steps": 930, "total_steps": 2066, "loss": 1.2233, "lr": 3.025144884372021e-06, "epoch": 0.9002904162633107, "percentage": 45.01, "elapsed_time": "0:43:58", "remaining_time": "0:53:43"}
+{"current_steps": 940, "total_steps": 2066, "loss": 1.2115, "lr": 2.9867648058343262e-06, "epoch": 0.9099709583736689, "percentage": 45.5, "elapsed_time": "0:44:33", "remaining_time": "0:53:22"}
+{"current_steps": 950, "total_steps": 2066, "loss": 1.2139, "lr": 2.948265104324941e-06, "epoch": 0.9196515004840271, "percentage": 45.98, "elapsed_time": "0:45:08", "remaining_time": "0:53:01"}
+{"current_steps": 960, "total_steps": 2066, "loss": 1.2201, "lr": 2.9096552411866903e-06, "epoch": 0.9293320425943853, "percentage": 46.47, "elapsed_time": "0:45:41", "remaining_time": "0:52:38"}
+{"current_steps": 970, "total_steps": 2066, "loss": 1.1997, "lr": 2.8709447048347394e-06, "epoch": 0.9390125847047435, "percentage": 46.95, "elapsed_time": "0:46:17", "remaining_time": "0:52:18"}
+{"current_steps": 980, "total_steps": 2066, "loss": 1.2363, "lr": 2.832143008424802e-06, "epoch": 0.9486931268151017, "percentage": 47.43, "elapsed_time": "0:46:52", "remaining_time": "0:51:56"}
+{"current_steps": 990, "total_steps": 2066, "loss": 1.2573, "lr": 2.7932596875152747e-06, "epoch": 0.9583736689254598, "percentage": 47.92, "elapsed_time": "0:47:47", "remaining_time": "0:51:56"}
+{"current_steps": 1000, "total_steps": 2066, "loss": 1.2403, "lr": 2.754304297723862e-06, "epoch": 0.968054211035818, "percentage": 48.4, "elapsed_time": "0:48:19", "remaining_time": "0:51:30"}
+{"current_steps": 1000, "total_steps": 2066, "eval_loss": 1.2926961183547974, "epoch": 0.968054211035818, "percentage": 48.4, "elapsed_time": "0:48:31", "remaining_time": "0:51:43"}
+{"current_steps": 1010, "total_steps": 2066, "loss": 1.2915, "lr": 2.7152864123792716e-06, "epoch": 0.9777347531461762, "percentage": 48.89, "elapsed_time": "0:50:11", "remaining_time": "0:52:28"}
+{"current_steps": 1020, "total_steps": 2066, "loss": 1.2246, "lr": 2.6762156201685627e-06, "epoch": 0.9874152952565344, "percentage": 49.37, "elapsed_time": "0:50:42", "remaining_time": "0:52:00"}
+{"current_steps": 1030, "total_steps": 2066, "loss": 1.302, "lr": 2.6371015227807127e-06, "epoch": 0.9970958373668926, "percentage": 49.85, "elapsed_time": "0:51:16", "remaining_time": "0:51:34"}
+{"current_steps": 1040, "total_steps": 2066, "loss": 1.1438, "lr": 2.5979537325469913e-06, "epoch": 1.0067763794772506, "percentage": 50.34, "elapsed_time": "0:51:49", "remaining_time": "0:51:07"}
+{"current_steps": 1050, "total_steps": 2066, "loss": 0.9893, "lr": 2.558781870078722e-06, "epoch": 1.016456921587609, "percentage": 50.82, "elapsed_time": "0:52:21", "remaining_time": "0:50:39"}
+{"current_steps": 1060, "total_steps": 2066, "loss": 0.9725, "lr": 2.5195955619030064e-06, "epoch": 1.026137463697967, "percentage": 51.31, "elapsed_time": "0:52:57", "remaining_time": "0:50:15"}
+{"current_steps": 1070, "total_steps": 2066, "loss": 0.9776, "lr": 2.480404438096994e-06, "epoch": 1.0358180058083253, "percentage": 51.79, "elapsed_time": "0:53:30", "remaining_time": "0:49:48"}
+{"current_steps": 1080, "total_steps": 2066, "loss": 1.0161, "lr": 2.441218129921278e-06, "epoch": 1.0454985479186834, "percentage": 52.27, "elapsed_time": "0:54:08", "remaining_time": "0:49:26"}
+{"current_steps": 1090, "total_steps": 2066, "loss": 1.0164, "lr": 2.402046267453009e-06, "epoch": 1.0551790900290416, "percentage": 52.76, "elapsed_time": "0:54:46", "remaining_time": "0:49:03"}
+{"current_steps": 1100, "total_steps": 2066, "loss": 0.9799, "lr": 2.3628984772192885e-06, "epoch": 1.0648596321393997, "percentage": 53.24, "elapsed_time": "0:55:18", "remaining_time": "0:48:33"}
+{"current_steps": 1110, "total_steps": 2066, "loss": 0.9829, "lr": 2.323784379831438e-06, "epoch": 1.074540174249758, "percentage": 53.73, "elapsed_time": "0:56:56", "remaining_time": "0:49:02"}
+{"current_steps": 1120, "total_steps": 2066, "loss": 0.9397, "lr": 2.2847135876207292e-06, "epoch": 1.084220716360116, "percentage": 54.21, "elapsed_time": "0:57:31", "remaining_time": "0:48:35"}
+{"current_steps": 1130, "total_steps": 2066, "loss": 0.9544, "lr": 2.245695702276139e-06, "epoch": 1.0939012584704744, "percentage": 54.7, "elapsed_time": "0:58:12", "remaining_time": "0:48:12"}
+{"current_steps": 1140, "total_steps": 2066, "loss": 0.9867, "lr": 2.2067403124847257e-06, "epoch": 1.1035818005808324, "percentage": 55.18, "elapsed_time": "0:58:44", "remaining_time": "0:47:42"}
+{"current_steps": 1150, "total_steps": 2066, "loss": 0.9843, "lr": 2.167856991575199e-06, "epoch": 1.1132623426911907, "percentage": 55.66, "elapsed_time": "0:59:15", "remaining_time": "0:47:12"}
+{"current_steps": 1160, "total_steps": 2066, "loss": 0.9621, "lr": 2.1290552951652614e-06, "epoch": 1.1229428848015488, "percentage": 56.15, "elapsed_time": "0:59:50", "remaining_time": "0:46:44"}
+{"current_steps": 1170, "total_steps": 2066, "loss": 1.0003, "lr": 2.09034475881331e-06, "epoch": 1.132623426911907, "percentage": 56.63, "elapsed_time": "1:00:26", "remaining_time": "0:46:17"}
+{"current_steps": 1180, "total_steps": 2066, "loss": 0.9598, "lr": 2.0517348956750597e-06, "epoch": 1.1423039690222652, "percentage": 57.12, "elapsed_time": "1:01:03", "remaining_time": "0:45:51"}
+{"current_steps": 1190, "total_steps": 2066, "loss": 0.9328, "lr": 2.0132351941656737e-06, "epoch": 1.1519845111326235, "percentage": 57.6, "elapsed_time": "1:01:38", "remaining_time": "0:45:22"}
+{"current_steps": 1200, "total_steps": 2066, "loss": 0.9994, "lr": 1.9748551156279803e-06, "epoch": 1.1616650532429815, "percentage": 58.08, "elapsed_time": "1:02:10", "remaining_time": "0:44:52"}
+{"current_steps": 1200, "total_steps": 2066, "eval_loss": 1.3311784267425537, "epoch": 1.1616650532429815, "percentage": 58.08, "elapsed_time": "1:02:21", "remaining_time": "0:45:00"}
+{"current_steps": 1210, "total_steps": 2066, "loss": 1.0021, "lr": 1.93660409200733e-06, "epoch": 1.1713455953533398, "percentage": 58.57, "elapsed_time": "1:03:58", "remaining_time": "0:45:15"}
+{"current_steps": 1220, "total_steps": 2066, "loss": 1.0096, "lr": 1.8984915235336934e-06, "epoch": 1.181026137463698, "percentage": 59.05, "elapsed_time": "1:04:31", "remaining_time": "0:44:44"}
+{"current_steps": 1230, "total_steps": 2066, "loss": 1.0096, "lr": 1.860526776411539e-06, "epoch": 1.1907066795740562, "percentage": 59.54, "elapsed_time": "1:05:11", "remaining_time": "0:44:18"}
+{"current_steps": 1240, "total_steps": 2066, "loss": 1.0768, "lr": 1.8227191805180806e-06, "epoch": 1.2003872216844143, "percentage": 60.02, "elapsed_time": "1:05:44", "remaining_time": "0:43:47"}
+{"current_steps": 1250, "total_steps": 2066, "loss": 1.0072, "lr": 1.7850780271104483e-06, "epoch": 1.2100677637947725, "percentage": 60.5, "elapsed_time": "1:06:20", "remaining_time": "0:43:18"}
+{"current_steps": 1260, "total_steps": 2066, "loss": 0.9954, "lr": 1.747612566542356e-06, "epoch": 1.2197483059051306, "percentage": 60.99, "elapsed_time": "1:06:51", "remaining_time": "0:42:45"}
+{"current_steps": 1270, "total_steps": 2066, "loss": 0.9856, "lr": 1.7103320059908093e-06, "epoch": 1.229428848015489, "percentage": 61.47, "elapsed_time": "1:07:23", "remaining_time": "0:42:14"}
+{"current_steps": 1280, "total_steps": 2066, "loss": 0.9882, "lr": 1.6732455071934424e-06, "epoch": 1.239109390125847, "percentage": 61.96, "elapsed_time": "1:07:55", "remaining_time": "0:41:42"}
+{"current_steps": 1290, "total_steps": 2066, "loss": 0.9218, "lr": 1.6363621841970022e-06, "epoch": 1.2487899322362053, "percentage": 62.44, "elapsed_time": "1:08:36", "remaining_time": "0:41:16"}
+{"current_steps": 1300, "total_steps": 2066, "loss": 0.9803, "lr": 1.5996911011175675e-06, "epoch": 1.2584704743465633, "percentage": 62.92, "elapsed_time": "1:09:08", "remaining_time": "0:40:44"}
+{"current_steps": 1310, "total_steps": 2066, "loss": 0.9732, "lr": 1.5632412699130306e-06, "epoch": 1.2681510164569216, "percentage": 63.41, "elapsed_time": "1:10:43", "remaining_time": "0:40:48"}
+{"current_steps": 1320, "total_steps": 2066, "loss": 0.9656, "lr": 1.5270216481683954e-06, "epoch": 1.2778315585672797, "percentage": 63.89, "elapsed_time": "1:11:17", "remaining_time": "0:40:17"}
+{"current_steps": 1330, "total_steps": 2066, "loss": 0.968, "lr": 1.4910411368944483e-06, "epoch": 1.287512100677638, "percentage": 64.38, "elapsed_time": "1:11:53", "remaining_time": "0:39:46"}
+{"current_steps": 1340, "total_steps": 2066, "loss": 1.0192, "lr": 1.4553085783403201e-06, "epoch": 1.297192642787996, "percentage": 64.86, "elapsed_time": "1:12:24", "remaining_time": "0:39:13"}
+{"current_steps": 1350, "total_steps": 2066, "loss": 0.9862, "lr": 1.419832753820496e-06, "epoch": 1.3068731848983544, "percentage": 65.34, "elapsed_time": "1:13:02", "remaining_time": "0:38:44"}
+{"current_steps": 1360, "total_steps": 2066, "loss": 0.9557, "lr": 1.3846223815568005e-06, "epoch": 1.3165537270087124, "percentage": 65.83, "elapsed_time": "1:13:35", "remaining_time": "0:38:12"}
+{"current_steps": 1370, "total_steps": 2066, "loss": 0.9833, "lr": 1.349686114535875e-06, "epoch": 1.3262342691190707, "percentage": 66.31, "elapsed_time": "1:14:11", "remaining_time": "0:37:41"}
+{"current_steps": 1380, "total_steps": 2066, "loss": 1.0331, "lr": 1.3150325383827117e-06, "epoch": 1.3359148112294288, "percentage": 66.8, "elapsed_time": "1:14:43", "remaining_time": "0:37:08"}
+{"current_steps": 1390, "total_steps": 2066, "loss": 1.0069, "lr": 1.2806701692507162e-06, "epoch": 1.345595353339787, "percentage": 67.28, "elapsed_time": "1:15:16", "remaining_time": "0:36:36"}
+{"current_steps": 1400, "total_steps": 2066, "loss": 0.9531, "lr": 1.2466074517288558e-06, "epoch": 1.3552758954501452, "percentage": 67.76, "elapsed_time": "1:15:49", "remaining_time": "0:36:04"}
+{"current_steps": 1400, "total_steps": 2066, "eval_loss": 1.3317842483520508, "epoch": 1.3552758954501452, "percentage": 67.76, "elapsed_time": "1:16:00", "remaining_time": "0:36:09"}
+{"current_steps": 1410, "total_steps": 2066, "loss": 0.991, "lr": 1.212852756766399e-06, "epoch": 1.3649564375605034, "percentage": 68.25, "elapsed_time": "1:17:41", "remaining_time": "0:36:08"}
+{"current_steps": 1420, "total_steps": 2066, "loss": 0.957, "lr": 1.1794143796157358e-06, "epoch": 1.3746369796708615, "percentage": 68.73, "elapsed_time": "1:18:14", "remaining_time": "0:35:35"}
+{"current_steps": 1430, "total_steps": 2066, "loss": 0.9406, "lr": 1.1463005377938182e-06, "epoch": 1.3843175217812198, "percentage": 69.22, "elapsed_time": "1:18:50", "remaining_time": "0:35:04"}
+{"current_steps": 1440, "total_steps": 2066, "loss": 0.958, "lr": 1.1135193690626926e-06, "epoch": 1.3939980638915779, "percentage": 69.7, "elapsed_time": "1:19:27", "remaining_time": "0:34:32"}
+{"current_steps": 1450, "total_steps": 2066, "loss": 1.0262, "lr": 1.0810789294296397e-06, "epoch": 1.4036786060019362, "percentage": 70.18, "elapsed_time": "1:20:02", "remaining_time": "0:34:00"}
+{"current_steps": 1460, "total_steps": 2066, "loss": 0.9745, "lr": 1.048987191167398e-06, "epoch": 1.4133591481122942, "percentage": 70.67, "elapsed_time": "1:20:35", "remaining_time": "0:33:26"}
+{"current_steps": 1470, "total_steps": 2066, "loss": 0.9759, "lr": 1.0172520408549716e-06, "epoch": 1.4230396902226525, "percentage": 71.15, "elapsed_time": "1:21:11", "remaining_time": "0:32:54"}
+{"current_steps": 1480, "total_steps": 2066, "loss": 1.0117, "lr": 9.858812774394946e-07, "epoch": 1.4327202323330106, "percentage": 71.64, "elapsed_time": "1:21:43", "remaining_time": "0:32:21"}
+{"current_steps": 1490, "total_steps": 2066, "loss": 0.9736, "lr": 9.548826103196304e-07, "epoch": 1.442400774443369, "percentage": 72.12, "elapsed_time": "1:22:16", "remaining_time": "0:31:48"}
+{"current_steps": 1500, "total_steps": 2066, "loss": 1.002, "lr": 9.242636574509828e-07, "epoch": 1.452081316553727, "percentage": 72.6, "elapsed_time": "1:22:50", "remaining_time": "0:31:15"}
+{"current_steps": 1510, "total_steps": 2066, "loss": 1.0391, "lr": 8.940319434739683e-07, "epoch": 1.4617618586640853, "percentage": 73.09, "elapsed_time": "1:24:32", "remaining_time": "0:31:07"}
+{"current_steps": 1520, "total_steps": 2066, "loss": 0.9864, "lr": 8.641948978646361e-07, "epoch": 1.4714424007744433, "percentage": 73.57, "elapsed_time": "1:25:10", "remaining_time": "0:30:35"}
+{"current_steps": 1530, "total_steps": 2066, "loss": 1.0425, "lr": 8.347598531088555e-07, "epoch": 1.4811229428848016, "percentage": 74.06, "elapsed_time": "1:25:45", "remaining_time": "0:30:02"}
+{"current_steps": 1540, "total_steps": 2066, "loss": 1.0028, "lr": 8.05734042900363e-07, "epoch": 1.4908034849951597, "percentage": 74.54, "elapsed_time": "1:26:18", "remaining_time": "0:29:28"}
+{"current_steps": 1550, "total_steps": 2066, "loss": 0.9764, "lr": 7.771246003630625e-07, "epoch": 1.500484027105518, "percentage": 75.02, "elapsed_time": "1:26:50", "remaining_time": "0:28:54"}
+{"current_steps": 1560, "total_steps": 2066, "loss": 0.9658, "lr": 7.489385562980589e-07, "epoch": 1.510164569215876, "percentage": 75.51, "elapsed_time": "1:27:25", "remaining_time": "0:28:21"}
+{"current_steps": 1570, "total_steps": 2066, "loss": 0.9621, "lr": 7.211828374558311e-07, "epoch": 1.5198451113262341, "percentage": 75.99, "elapsed_time": "1:27:56", "remaining_time": "0:27:47"}
+{"current_steps": 1580, "total_steps": 2066, "loss": 0.9874, "lr": 6.938642648339719e-07, "epoch": 1.5295256534365924, "percentage": 76.48, "elapsed_time": "1:28:29", "remaining_time": "0:27:13"}
+{"current_steps": 1590, "total_steps": 2066, "loss": 0.9481, "lr": 6.669895520009239e-07, "epoch": 1.5392061955469507, "percentage": 76.96, "elapsed_time": "1:29:02", "remaining_time": "0:26:39"}
+{"current_steps": 1600, "total_steps": 2066, "loss": 0.9555, "lr": 6.405653034461115e-07, "epoch": 1.5488867376573088, "percentage": 77.44, "elapsed_time": "1:29:36", "remaining_time": "0:26:06"}
+{"current_steps": 1600, "total_steps": 2066, "eval_loss": 1.3306459188461304, "epoch": 1.5488867376573088, "percentage": 77.44, "elapsed_time": "1:29:47", "remaining_time": "0:26:09"}
+{"current_steps": 1610, "total_steps": 2066, "loss": 1.0002, "lr": 6.145980129568823e-07, "epoch": 1.5585672797676668, "percentage": 77.93, "elapsed_time": "1:31:24", "remaining_time": "0:25:53"}
+{"current_steps": 1620, "total_steps": 2066, "loss": 1.0028, "lr": 5.890940620226479e-07, "epoch": 1.5682478218780251, "percentage": 78.41, "elapsed_time": "1:31:57", "remaining_time": "0:25:19"}
+{"current_steps": 1630, "total_steps": 2066, "loss": 0.9734, "lr": 5.640597182666324e-07, "epoch": 1.5779283639883834, "percentage": 78.9, "elapsed_time": "1:32:29", "remaining_time": "0:24:44"}
+{"current_steps": 1640, "total_steps": 2066, "loss": 0.976, "lr": 5.395011339055886e-07, "epoch": 1.5876089060987415, "percentage": 79.38, "elapsed_time": "1:33:05", "remaining_time": "0:24:10"}
+{"current_steps": 1650, "total_steps": 2066, "loss": 0.9662, "lr": 5.154243442378934e-07, "epoch": 1.5972894482090996, "percentage": 79.86, "elapsed_time": "1:33:38", "remaining_time": "0:23:36"}
+{"current_steps": 1660, "total_steps": 2066, "loss": 1.0096, "lr": 4.918352661603604e-07, "epoch": 1.6069699903194579, "percentage": 80.35, "elapsed_time": "1:34:15", "remaining_time": "0:23:03"}
+{"current_steps": 1670, "total_steps": 2066, "loss": 0.9944, "lr": 4.687396967141583e-07, "epoch": 1.6166505324298162, "percentage": 80.83, "elapsed_time": "1:34:52", "remaining_time": "0:22:29"}
+{"current_steps": 1680, "total_steps": 2066, "loss": 0.9976, "lr": 4.4614331166018403e-07, "epoch": 1.6263310745401742, "percentage": 81.32, "elapsed_time": "1:35:28", "remaining_time": "0:21:56"}
+{"current_steps": 1690, "total_steps": 2066, "loss": 0.9404, "lr": 4.2405166408423154e-07, "epoch": 1.6360116166505323, "percentage": 81.8, "elapsed_time": "1:36:00", "remaining_time": "0:21:21"}
+{"current_steps": 1700, "total_steps": 2066, "loss": 0.9871, "lr": 4.0247018303232437e-07, "epoch": 1.6456921587608906, "percentage": 82.28, "elapsed_time": "1:36:35", "remaining_time": "0:20:47"}
+{"current_steps": 1710, "total_steps": 2066, "loss": 0.9848, "lr": 3.8140417217651437e-07, "epoch": 1.6553727008712489, "percentage": 82.77, "elapsed_time": "1:38:17", "remaining_time": "0:20:27"}
+{"current_steps": 1720, "total_steps": 2066, "loss": 1.0306, "lr": 3.608588085115028e-07, "epoch": 1.665053242981607, "percentage": 83.25, "elapsed_time": "1:39:07", "remaining_time": "0:19:56"}
+{"current_steps": 1730, "total_steps": 2066, "loss": 0.9356, "lr": 3.408391410823864e-07, "epoch": 1.674733785091965, "percentage": 83.74, "elapsed_time": "1:39:51", "remaining_time": "0:19:23"}
+{"current_steps": 1740, "total_steps": 2066, "loss": 0.9694, "lr": 3.213500897438487e-07, "epoch": 1.6844143272023233, "percentage": 84.22, "elapsed_time": "1:40:24", "remaining_time": "0:18:48"}
+{"current_steps": 1750, "total_steps": 2066, "loss": 0.9816, "lr": 3.023964439511026e-07, "epoch": 1.6940948693126816, "percentage": 84.7, "elapsed_time": "1:41:02", "remaining_time": "0:18:14"}
+{"current_steps": 1760, "total_steps": 2066, "loss": 1.0213, "lr": 2.839828615828744e-07, "epoch": 1.7037754114230397, "percentage": 85.19, "elapsed_time": "1:41:42", "remaining_time": "0:17:41"}
+{"current_steps": 1770, "total_steps": 2066, "loss": 0.9737, "lr": 2.6611386779672786e-07, "epoch": 1.7134559535333977, "percentage": 85.67, "elapsed_time": "1:42:28", "remaining_time": "0:17:08"}
+{"current_steps": 1780, "total_steps": 2066, "loss": 0.9567, "lr": 2.487938539169982e-07, "epoch": 1.723136495643756, "percentage": 86.16, "elapsed_time": "1:43:02", "remaining_time": "0:16:33"}
+{"current_steps": 1790, "total_steps": 2066, "loss": 0.9943, "lr": 2.3202707635562371e-07, "epoch": 1.7328170377541143, "percentage": 86.64, "elapsed_time": "1:43:39", "remaining_time": "0:15:58"}
+{"current_steps": 1800, "total_steps": 2066, "loss": 0.9866, "lr": 2.1581765556612233e-07, "epoch": 1.7424975798644724, "percentage": 87.12, "elapsed_time": "1:44:12", "remaining_time": "0:15:24"}
+{"current_steps": 1800, "total_steps": 2066, "eval_loss": 1.3297312259674072, "epoch": 1.7424975798644724, "percentage": 87.12, "elapsed_time": "1:44:25", "remaining_time": "0:15:25"}
+{"current_steps": 1810, "total_steps": 2066, "loss": 0.9948, "lr": 2.001695750309926e-07, "epoch": 1.7521781219748305, "percentage": 87.61, "elapsed_time": "1:46:02", "remaining_time": "0:14:59"}
+{"current_steps": 1820, "total_steps": 2066, "loss": 0.9888, "lr": 1.8508668028276305e-07, "epoch": 1.7618586640851888, "percentage": 88.09, "elapsed_time": "1:46:37", "remaining_time": "0:14:24"}
+{"current_steps": 1830, "total_steps": 2066, "loss": 1.0625, "lr": 1.7057267795895117e-07, "epoch": 1.771539206195547, "percentage": 88.58, "elapsed_time": "1:47:14", "remaining_time": "0:13:49"}
+{"current_steps": 1840, "total_steps": 2066, "loss": 0.9834, "lr": 1.566311348911534e-07, "epoch": 1.7812197483059051, "percentage": 89.06, "elapsed_time": "1:47:47", "remaining_time": "0:13:14"}
+{"current_steps": 1850, "total_steps": 2066, "loss": 0.9507, "lr": 1.4326547722848972e-07, "epoch": 1.7909002904162632, "percentage": 89.55, "elapsed_time": "1:48:19", "remaining_time": "0:12:38"}
+{"current_steps": 1860, "total_steps": 2066, "loss": 0.9997, "lr": 1.3047898959562767e-07, "epoch": 1.8005808325266215, "percentage": 90.03, "elapsed_time": "1:48:56", "remaining_time": "0:12:03"}
+{"current_steps": 1870, "total_steps": 2066, "loss": 0.9919, "lr": 1.1827481428557969e-07, "epoch": 1.8102613746369798, "percentage": 90.51, "elapsed_time": "1:49:47", "remaining_time": "0:11:30"}
+{"current_steps": 1880, "total_steps": 2066, "loss": 0.9774, "lr": 1.0665595048748257e-07, "epoch": 1.8199419167473379, "percentage": 91.0, "elapsed_time": "1:50:29", "remaining_time": "0:10:55"}
+{"current_steps": 1890, "total_steps": 2066, "loss": 1.0053, "lr": 9.562525354954194e-08, "epoch": 1.829622458857696, "percentage": 91.48, "elapsed_time": "1:51:11", "remaining_time": "0:10:21"}
+{"current_steps": 1900, "total_steps": 2066, "loss": 0.9824, "lr": 8.518543427732951e-08, "epoch": 1.8393030009680542, "percentage": 91.97, "elapsed_time": "1:51:47", "remaining_time": "0:09:46"}
+{"current_steps": 1910, "total_steps": 2066, "loss": 0.9876, "lr": 7.53390582675978e-08, "epoch": 1.8489835430784125, "percentage": 92.45, "elapsed_time": "1:53:24", "remaining_time": "0:09:15"}
+{"current_steps": 1920, "total_steps": 2066, "loss": 0.9587, "lr": 6.608854527778319e-08, "epoch": 1.8586640851887706, "percentage": 92.93, "elapsed_time": "1:54:05", "remaining_time": "0:08:40"}
+{"current_steps": 1930, "total_steps": 2066, "loss": 0.9771, "lr": 5.743616863134793e-08, "epoch": 1.8683446272991286, "percentage": 93.42, "elapsed_time": "1:54:39", "remaining_time": "0:08:04"}
+{"current_steps": 1940, "total_steps": 2066, "loss": 0.9591, "lr": 4.938405465910706e-08, "epoch": 1.878025169409487, "percentage": 93.9, "elapsed_time": "1:55:09", "remaining_time": "0:07:28"}
+{"current_steps": 1950, "total_steps": 2066, "loss": 0.9684, "lr": 4.193418217668305e-08, "epoch": 1.8877057115198452, "percentage": 94.39, "elapsed_time": "1:55:42", "remaining_time": "0:06:52"}
+{"current_steps": 1960, "total_steps": 2066, "loss": 0.9637, "lr": 3.508838199820591e-08, "epoch": 1.8973862536302033, "percentage": 94.87, "elapsed_time": "1:56:14", "remaining_time": "0:06:17"}
+{"current_steps": 1970, "total_steps": 2066, "loss": 1.0192, "lr": 2.884833648639257e-08, "epoch": 1.9070667957405614, "percentage": 95.35, "elapsed_time": "1:56:48", "remaining_time": "0:05:41"}
+{"current_steps": 1980, "total_steps": 2066, "loss": 0.9587, "lr": 2.3215579139101996e-08, "epoch": 1.9167473378509197, "percentage": 95.84, "elapsed_time": "1:57:20", "remaining_time": "0:05:05"}
+{"current_steps": 1990, "total_steps": 2066, "loss": 0.9727, "lr": 1.8191494212477513e-08, "epoch": 1.926427879961278, "percentage": 96.32, "elapsed_time": "1:57:53", "remaining_time": "0:04:30"}
+{"current_steps": 2000, "total_steps": 2066, "loss": 0.9891, "lr": 1.3777316380763073e-08, "epoch": 1.936108422071636, "percentage": 96.81, "elapsed_time": "1:58:27", "remaining_time": "0:03:54"}
+{"current_steps": 2000, "total_steps": 2066, "eval_loss": 1.328829288482666, "epoch": 1.936108422071636, "percentage": 96.81, "elapsed_time": "1:58:38", "remaining_time": "0:03:54"}
+{"current_steps": 2010, "total_steps": 2066, "loss": 0.9352, "lr": 9.9741304328832e-09, "epoch": 1.945788964181994, "percentage": 97.29, "elapsed_time": "2:00:15", "remaining_time": "0:03:21"}
+{"current_steps": 2020, "total_steps": 2066, "loss": 0.9591, "lr": 6.782871005851788e-09, "epoch": 1.9554695062923524, "percentage": 97.77, "elapsed_time": "2:00:49", "remaining_time": "0:02:45"}
+{"current_steps": 2030, "total_steps": 2066, "loss": 0.9201, "lr": 4.2043223550869425e-09, "epoch": 1.9651500484027107, "percentage": 98.26, "elapsed_time": "2:01:20", "remaining_time": "0:02:09"}
+{"current_steps": 2040, "total_steps": 2066, "loss": 0.9517, "lr": 2.239118161677656e-09, "epoch": 1.9748305905130688, "percentage": 98.74, "elapsed_time": "2:01:53", "remaining_time": "0:01:33"}
+{"current_steps": 2050, "total_steps": 2066, "loss": 0.9299, "lr": 8.877413766561482e-10, "epoch": 1.9845111326234268, "percentage": 99.23, "elapsed_time": "2:02:31", "remaining_time": "0:00:57"}
+{"current_steps": 2060, "total_steps": 2066, "loss": 0.9732, "lr": 1.5052410231336522e-10, "epoch": 1.9941916747337851, "percentage": 99.71, "elapsed_time": "2:03:11", "remaining_time": "0:00:21"}
+{"current_steps": 2066, "total_steps": 2066, "epoch": 2.0, "percentage": 100.0, "elapsed_time": "2:04:34", "remaining_time": "0:00:00"}

trainer_state.json ADDED Viewed

	@@ -0,0 +1,1565 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 2.0,
+  "eval_steps": 200,
+  "global_step": 2066,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.00968054211035818,
+      "grad_norm": 4.820261516808518,
+      "learning_rate": 7.258064516129033e-07,
+      "loss": 1.7236,
+      "step": 10
+    },
+    {
+      "epoch": 0.01936108422071636,
+      "grad_norm": 3.626062262613067,
+      "learning_rate": 1.5322580645161292e-06,
+      "loss": 1.627,
+      "step": 20
+    },
+    {
+      "epoch": 0.02904162633107454,
+      "grad_norm": 2.5817816313172064,
+      "learning_rate": 2.338709677419355e-06,
+      "loss": 1.4318,
+      "step": 30
+    },
+    {
+      "epoch": 0.03872216844143272,
+      "grad_norm": 2.120575175645984,
+      "learning_rate": 3.145161290322581e-06,
+      "loss": 1.3885,
+      "step": 40
+    },
+    {
+      "epoch": 0.0484027105517909,
+      "grad_norm": 2.3529433456721085,
+      "learning_rate": 3.951612903225807e-06,
+      "loss": 1.3955,
+      "step": 50
+    },
+    {
+      "epoch": 0.05808325266214908,
+      "grad_norm": 2.0836995501157407,
+      "learning_rate": 4.758064516129033e-06,
+      "loss": 1.2718,
+      "step": 60
+    },
+    {
+      "epoch": 0.06776379477250725,
+      "grad_norm": 2.2076440974932288,
+      "learning_rate": 4.999849475897687e-06,
+      "loss": 1.3654,
+      "step": 70
+    },
+    {
+      "epoch": 0.07744433688286544,
+      "grad_norm": 2.0109935662674743,
+      "learning_rate": 4.999112258623345e-06,
+      "loss": 1.2831,
+      "step": 80
+    },
+    {
+      "epoch": 0.08712487899322362,
+      "grad_norm": 2.137200234785965,
+      "learning_rate": 4.997760881838323e-06,
+      "loss": 1.3002,
+      "step": 90
+    },
+    {
+      "epoch": 0.0968054211035818,
+      "grad_norm": 2.0451930107997325,
+      "learning_rate": 4.995795677644913e-06,
+      "loss": 1.287,
+      "step": 100
+    },
+    {
+      "epoch": 0.10648596321393998,
+      "grad_norm": 2.018997434809762,
+      "learning_rate": 4.993217128994149e-06,
+      "loss": 1.2492,
+      "step": 110
+    },
+    {
+      "epoch": 0.11616650532429816,
+      "grad_norm": 2.08777196120052,
+      "learning_rate": 4.9900258695671176e-06,
+      "loss": 1.2794,
+      "step": 120
+    },
+    {
+      "epoch": 0.12584704743465633,
+      "grad_norm": 2.224636401575338,
+      "learning_rate": 4.986222683619237e-06,
+      "loss": 1.2506,
+      "step": 130
+    },
+    {
+      "epoch": 0.1355275895450145,
+      "grad_norm": 1.7965192828814696,
+      "learning_rate": 4.981808505787523e-06,
+      "loss": 1.2609,
+      "step": 140
+    },
+    {
+      "epoch": 0.1452081316553727,
+      "grad_norm": 1.8146840191247664,
+      "learning_rate": 4.976784420860898e-06,
+      "loss": 1.2329,
+      "step": 150
+    },
+    {
+      "epoch": 0.15488867376573087,
+      "grad_norm": 2.1056167196365423,
+      "learning_rate": 4.971151663513608e-06,
+      "loss": 1.3551,
+      "step": 160
+    },
+    {
+      "epoch": 0.16456921587608905,
+      "grad_norm": 1.969170099552798,
+      "learning_rate": 4.964911618001794e-06,
+      "loss": 1.261,
+      "step": 170
+    },
+    {
+      "epoch": 0.17424975798644723,
+      "grad_norm": 1.7282575615370996,
+      "learning_rate": 4.958065817823318e-06,
+      "loss": 1.2055,
+      "step": 180
+    },
+    {
+      "epoch": 0.18393030009680542,
+      "grad_norm": 2.263093065284853,
+      "learning_rate": 4.950615945340893e-06,
+      "loss": 1.3022,
+      "step": 190
+    },
+    {
+      "epoch": 0.1936108422071636,
+      "grad_norm": 1.947338641667041,
+      "learning_rate": 4.942563831368653e-06,
+      "loss": 1.2701,
+      "step": 200
+    },
+    {
+      "epoch": 0.1936108422071636,
+      "eval_loss": 1.3192224502563477,
+      "eval_runtime": 11.1204,
+      "eval_samples_per_second": 60.07,
+      "eval_steps_per_second": 3.777,
+      "step": 200
+    },
+    {
+      "epoch": 0.20329138431752178,
+      "grad_norm": 2.3763373561887176,
+      "learning_rate": 4.933911454722217e-06,
+      "loss": 1.277,
+      "step": 210
+    },
+    {
+      "epoch": 0.21297192642787996,
+      "grad_norm": 1.8476884141723375,
+      "learning_rate": 4.924660941732403e-06,
+      "loss": 1.2418,
+      "step": 220
+    },
+    {
+      "epoch": 0.22265246853823814,
+      "grad_norm": 2.088709216532341,
+      "learning_rate": 4.914814565722671e-06,
+      "loss": 1.294,
+      "step": 230
+    },
+    {
+      "epoch": 0.23233301064859632,
+      "grad_norm": 1.9726270845953118,
+      "learning_rate": 4.9043747464504586e-06,
+      "loss": 1.2823,
+      "step": 240
+    },
+    {
+      "epoch": 0.2420135527589545,
+      "grad_norm": 2.153285232108024,
+      "learning_rate": 4.893344049512519e-06,
+      "loss": 1.2753,
+      "step": 250
+    },
+    {
+      "epoch": 0.25169409486931266,
+      "grad_norm": 1.698193104908672,
+      "learning_rate": 4.881725185714421e-06,
+      "loss": 1.1851,
+      "step": 260
+    },
+    {
+      "epoch": 0.26137463697967084,
+      "grad_norm": 2.2926805502010166,
+      "learning_rate": 4.869521010404373e-06,
+      "loss": 1.2901,
+      "step": 270
+    },
+    {
+      "epoch": 0.271055179090029,
+      "grad_norm": 1.93897800592033,
+      "learning_rate": 4.856734522771512e-06,
+      "loss": 1.246,
+      "step": 280
+    },
+    {
+      "epoch": 0.2807357212003872,
+      "grad_norm": 1.9602823265085942,
+      "learning_rate": 4.843368865108847e-06,
+      "loss": 1.204,
+      "step": 290
+    },
+    {
+      "epoch": 0.2904162633107454,
+      "grad_norm": 2.002239339436152,
+      "learning_rate": 4.8294273220410494e-06,
+      "loss": 1.271,
+      "step": 300
+    },
+    {
+      "epoch": 0.30009680542110356,
+      "grad_norm": 1.9017334477170922,
+      "learning_rate": 4.814913319717238e-06,
+      "loss": 1.307,
+      "step": 310
+    },
+    {
+      "epoch": 0.30977734753146174,
+      "grad_norm": 2.042693902021796,
+      "learning_rate": 4.799830424969008e-06,
+      "loss": 1.273,
+      "step": 320
+    },
+    {
+      "epoch": 0.3194578896418199,
+      "grad_norm": 1.948504976997528,
+      "learning_rate": 4.784182344433878e-06,
+      "loss": 1.2719,
+      "step": 330
+    },
+    {
+      "epoch": 0.3291384317521781,
+      "grad_norm": 2.083841470170843,
+      "learning_rate": 4.767972923644377e-06,
+      "loss": 1.2732,
+      "step": 340
+    },
+    {
+      "epoch": 0.3388189738625363,
+      "grad_norm": 2.1919169498570055,
+      "learning_rate": 4.751206146083002e-06,
+      "loss": 1.3289,
+      "step": 350
+    },
+    {
+      "epoch": 0.34849951597289447,
+      "grad_norm": 1.9533269346119766,
+      "learning_rate": 4.7338861322032724e-06,
+      "loss": 1.2303,
+      "step": 360
+    },
+    {
+      "epoch": 0.35818005808325265,
+      "grad_norm": 1.9672032803697441,
+      "learning_rate": 4.716017138417126e-06,
+      "loss": 1.1788,
+      "step": 370
+    },
+    {
+      "epoch": 0.36786060019361083,
+      "grad_norm": 1.7493745751363814,
+      "learning_rate": 4.697603556048899e-06,
+      "loss": 1.2543,
+      "step": 380
+    },
+    {
+      "epoch": 0.377541142303969,
+      "grad_norm": 1.8808999956966037,
+      "learning_rate": 4.6786499102561525e-06,
+      "loss": 1.3091,
+      "step": 390
+    },
+    {
+      "epoch": 0.3872216844143272,
+      "grad_norm": 1.8576232558705237,
+      "learning_rate": 4.659160858917614e-06,
+      "loss": 1.2693,
+      "step": 400
+    },
+    {
+      "epoch": 0.3872216844143272,
+      "eval_loss": 1.3101810216903687,
+      "eval_runtime": 11.0364,
+      "eval_samples_per_second": 60.527,
+      "eval_steps_per_second": 3.806,
+      "step": 400
+    },
+    {
+      "epoch": 0.3969022265246854,
+      "grad_norm": 2.169929911455977,
+      "learning_rate": 4.639141191488498e-06,
+      "loss": 1.2866,
+      "step": 410
+    },
+    {
+      "epoch": 0.40658276863504356,
+      "grad_norm": 1.903451878169039,
+      "learning_rate": 4.618595827823486e-06,
+      "loss": 1.3088,
+      "step": 420
+    },
+    {
+      "epoch": 0.41626331074540174,
+      "grad_norm": 1.8380999378945895,
+      "learning_rate": 4.597529816967676e-06,
+      "loss": 1.2445,
+      "step": 430
+    },
+    {
+      "epoch": 0.4259438528557599,
+      "grad_norm": 1.794994240260647,
+      "learning_rate": 4.575948335915769e-06,
+      "loss": 1.2679,
+      "step": 440
+    },
+    {
+      "epoch": 0.4356243949661181,
+      "grad_norm": 1.8131554887283838,
+      "learning_rate": 4.553856688339817e-06,
+      "loss": 1.2699,
+      "step": 450
+    },
+    {
+      "epoch": 0.4453049370764763,
+      "grad_norm": 1.822749339089404,
+      "learning_rate": 4.531260303285841e-06,
+      "loss": 1.2381,
+      "step": 460
+    },
+    {
+      "epoch": 0.45498547918683446,
+      "grad_norm": 1.7488634444209434,
+      "learning_rate": 4.50816473383964e-06,
+      "loss": 1.3089,
+      "step": 470
+    },
+    {
+      "epoch": 0.46466602129719264,
+      "grad_norm": 1.8572372415183604,
+      "learning_rate": 4.484575655762107e-06,
+      "loss": 1.2271,
+      "step": 480
+    },
+    {
+      "epoch": 0.4743465634075508,
+      "grad_norm": 2.072794841066779,
+      "learning_rate": 4.460498866094412e-06,
+      "loss": 1.2136,
+      "step": 490
+    },
+    {
+      "epoch": 0.484027105517909,
+      "grad_norm": 1.7764723909580904,
+      "learning_rate": 4.435940281733369e-06,
+      "loss": 1.2747,
+      "step": 500
+    },
+    {
+      "epoch": 0.4937076476282672,
+      "grad_norm": 2.0917776639027914,
+      "learning_rate": 4.410905937977353e-06,
+      "loss": 1.265,
+      "step": 510
+    },
+    {
+      "epoch": 0.5033881897386253,
+      "grad_norm": 2.0260759502040786,
+      "learning_rate": 4.385401987043118e-06,
+      "loss": 1.2895,
+      "step": 520
+    },
+    {
+      "epoch": 0.5130687318489835,
+      "grad_norm": 1.9734465330232784,
+      "learning_rate": 4.359434696553889e-06,
+      "loss": 1.2376,
+      "step": 530
+    },
+    {
+      "epoch": 0.5227492739593417,
+      "grad_norm": 1.7584757436973566,
+      "learning_rate": 4.333010447999077e-06,
+      "loss": 1.2575,
+      "step": 540
+    },
+    {
+      "epoch": 0.5324298160696999,
+      "grad_norm": 1.9187648320497177,
+      "learning_rate": 4.3061357351660285e-06,
+      "loss": 1.267,
+      "step": 550
+    },
+    {
+      "epoch": 0.542110358180058,
+      "grad_norm": 1.7622412801152667,
+      "learning_rate": 4.27881716254417e-06,
+      "loss": 1.2584,
+      "step": 560
+    },
+    {
+      "epoch": 0.5517909002904162,
+      "grad_norm": 1.9661817706505653,
+      "learning_rate": 4.251061443701941e-06,
+      "loss": 1.2263,
+      "step": 570
+    },
+    {
+      "epoch": 0.5614714424007744,
+      "grad_norm": 1.8276707656109445,
+      "learning_rate": 4.222875399636938e-06,
+      "loss": 1.2231,
+      "step": 580
+    },
+    {
+      "epoch": 0.5711519845111326,
+      "grad_norm": 2.04251694462628,
+      "learning_rate": 4.194265957099638e-06,
+      "loss": 1.2656,
+      "step": 590
+    },
+    {
+      "epoch": 0.5808325266214908,
+      "grad_norm": 1.6830527245888396,
+      "learning_rate": 4.165240146891145e-06,
+      "loss": 1.2341,
+      "step": 600
+    },
+    {
+      "epoch": 0.5808325266214908,
+      "eval_loss": 1.3036646842956543,
+      "eval_runtime": 11.0157,
+      "eval_samples_per_second": 60.641,
+      "eval_steps_per_second": 3.813,
+      "step": 600
+    },
+    {
+      "epoch": 0.590513068731849,
+      "grad_norm": 1.8700224183798644,
+      "learning_rate": 4.1358051021353655e-06,
+      "loss": 1.2413,
+      "step": 610
+    },
+    {
+      "epoch": 0.6001936108422071,
+      "grad_norm": 2.025528365164639,
+      "learning_rate": 4.1059680565260315e-06,
+      "loss": 1.2342,
+      "step": 620
+    },
+    {
+      "epoch": 0.6098741529525653,
+      "grad_norm": 1.7543494271951428,
+      "learning_rate": 4.0757363425490185e-06,
+      "loss": 1.1899,
+      "step": 630
+    },
+    {
+      "epoch": 0.6195546950629235,
+      "grad_norm": 1.8857378829308964,
+      "learning_rate": 4.04511738968037e-06,
+      "loss": 1.1912,
+      "step": 640
+    },
+    {
+      "epoch": 0.6292352371732817,
+      "grad_norm": 1.8763965346178748,
+      "learning_rate": 4.0141187225605064e-06,
+      "loss": 1.2066,
+      "step": 650
+    },
+    {
+      "epoch": 0.6389157792836399,
+      "grad_norm": 1.7240907552108375,
+      "learning_rate": 3.98274795914503e-06,
+      "loss": 1.2394,
+      "step": 660
+    },
+    {
+      "epoch": 0.648596321393998,
+      "grad_norm": 1.7995970571853088,
+      "learning_rate": 3.951012808832603e-06,
+      "loss": 1.2069,
+      "step": 670
+    },
+    {
+      "epoch": 0.6582768635043562,
+      "grad_norm": 2.0416216909511355,
+      "learning_rate": 3.918921070570361e-06,
+      "loss": 1.2724,
+      "step": 680
+    },
+    {
+      "epoch": 0.6679574056147144,
+      "grad_norm": 1.8831005612989327,
+      "learning_rate": 3.886480630937307e-06,
+      "loss": 1.3105,
+      "step": 690
+    },
+    {
+      "epoch": 0.6776379477250726,
+      "grad_norm": 1.8563831558700061,
+      "learning_rate": 3.853699462206183e-06,
+      "loss": 1.1989,
+      "step": 700
+    },
+    {
+      "epoch": 0.6873184898354308,
+      "grad_norm": 1.642368767478371,
+      "learning_rate": 3.820585620384265e-06,
+      "loss": 1.3256,
+      "step": 710
+    },
+    {
+      "epoch": 0.6969990319457889,
+      "grad_norm": 1.8190881361027762,
+      "learning_rate": 3.787147243233602e-06,
+      "loss": 1.2206,
+      "step": 720
+    },
+    {
+      "epoch": 0.7066795740561471,
+      "grad_norm": 1.9867148410918392,
+      "learning_rate": 3.753392548271144e-06,
+      "loss": 1.2245,
+      "step": 730
+    },
+    {
+      "epoch": 0.7163601161665053,
+      "grad_norm": 1.6456620597756189,
+      "learning_rate": 3.7193298307492855e-06,
+      "loss": 1.2685,
+      "step": 740
+    },
+    {
+      "epoch": 0.7260406582768635,
+      "grad_norm": 1.7765857550257254,
+      "learning_rate": 3.6849674616172887e-06,
+      "loss": 1.2379,
+      "step": 750
+    },
+    {
+      "epoch": 0.7357212003872217,
+      "grad_norm": 1.7474726903371232,
+      "learning_rate": 3.6503138854641257e-06,
+      "loss": 1.2176,
+      "step": 760
+    },
+    {
+      "epoch": 0.7454017424975798,
+      "grad_norm": 1.6559737928060483,
+      "learning_rate": 3.615377618443201e-06,
+      "loss": 1.2751,
+      "step": 770
+    },
+    {
+      "epoch": 0.755082284607938,
+      "grad_norm": 1.6608737876417763,
+      "learning_rate": 3.5801672461795032e-06,
+      "loss": 1.2335,
+      "step": 780
+    },
+    {
+      "epoch": 0.7647628267182962,
+      "grad_norm": 1.777841876809436,
+      "learning_rate": 3.5446914216596805e-06,
+      "loss": 1.2816,
+      "step": 790
+    },
+    {
+      "epoch": 0.7744433688286544,
+      "grad_norm": 1.7176497633610424,
+      "learning_rate": 3.5089588631055527e-06,
+      "loss": 1.1997,
+      "step": 800
+    },
+    {
+      "epoch": 0.7744433688286544,
+      "eval_loss": 1.2973461151123047,
+      "eval_runtime": 11.1321,
+      "eval_samples_per_second": 60.006,
+      "eval_steps_per_second": 3.773,
+      "step": 800
+    },
+    {
+      "epoch": 0.7841239109390126,
+      "grad_norm": 1.68764869861233,
+      "learning_rate": 3.472978351831606e-06,
+      "loss": 1.2153,
+      "step": 810
+    },
+    {
+      "epoch": 0.7938044530493708,
+      "grad_norm": 1.8236059508367688,
+      "learning_rate": 3.436758730086971e-06,
+      "loss": 1.1981,
+      "step": 820
+    },
+    {
+      "epoch": 0.8034849951597289,
+      "grad_norm": 2.095607905740359,
+      "learning_rate": 3.4003088988824323e-06,
+      "loss": 1.2271,
+      "step": 830
+    },
+    {
+      "epoch": 0.8131655372700871,
+      "grad_norm": 2.2791939009599282,
+      "learning_rate": 3.363637815802998e-06,
+      "loss": 1.2394,
+      "step": 840
+    },
+    {
+      "epoch": 0.8228460793804453,
+      "grad_norm": 1.6243179467285545,
+      "learning_rate": 3.326754492806559e-06,
+      "loss": 1.2334,
+      "step": 850
+    },
+    {
+      "epoch": 0.8325266214908035,
+      "grad_norm": 2.031052243391612,
+      "learning_rate": 3.2896679940091913e-06,
+      "loss": 1.2327,
+      "step": 860
+    },
+    {
+      "epoch": 0.8422071636011617,
+      "grad_norm": 1.8731727882620728,
+      "learning_rate": 3.2523874334576456e-06,
+      "loss": 1.2282,
+      "step": 870
+    },
+    {
+      "epoch": 0.8518877057115198,
+      "grad_norm": 1.673399951977698,
+      "learning_rate": 3.214921972889552e-06,
+      "loss": 1.2132,
+      "step": 880
+    },
+    {
+      "epoch": 0.861568247821878,
+      "grad_norm": 1.9311521456994778,
+      "learning_rate": 3.17728081948192e-06,
+      "loss": 1.2473,
+      "step": 890
+    },
+    {
+      "epoch": 0.8712487899322362,
+      "grad_norm": 1.6883274712781868,
+      "learning_rate": 3.139473223588462e-06,
+      "loss": 1.2524,
+      "step": 900
+    },
+    {
+      "epoch": 0.8809293320425944,
+      "grad_norm": 1.7885335433801643,
+      "learning_rate": 3.1015084764663074e-06,
+      "loss": 1.2423,
+      "step": 910
+    },
+    {
+      "epoch": 0.8906098741529526,
+      "grad_norm": 1.588598381552853,
+      "learning_rate": 3.063395907992671e-06,
+      "loss": 1.1997,
+      "step": 920
+    },
+    {
+      "epoch": 0.9002904162633107,
+      "grad_norm": 1.7465829948784894,
+      "learning_rate": 3.025144884372021e-06,
+      "loss": 1.2233,
+      "step": 930
+    },
+    {
+      "epoch": 0.9099709583736689,
+      "grad_norm": 1.7479955519439432,
+      "learning_rate": 2.9867648058343262e-06,
+      "loss": 1.2115,
+      "step": 940
+    },
+    {
+      "epoch": 0.9196515004840271,
+      "grad_norm": 1.838586531536656,
+      "learning_rate": 2.948265104324941e-06,
+      "loss": 1.2139,
+      "step": 950
+    },
+    {
+      "epoch": 0.9293320425943853,
+      "grad_norm": 1.8881700515006026,
+      "learning_rate": 2.9096552411866903e-06,
+      "loss": 1.2201,
+      "step": 960
+    },
+    {
+      "epoch": 0.9390125847047435,
+      "grad_norm": 1.9341900631906526,
+      "learning_rate": 2.8709447048347394e-06,
+      "loss": 1.1997,
+      "step": 970
+    },
+    {
+      "epoch": 0.9486931268151017,
+      "grad_norm": 1.689669805952995,
+      "learning_rate": 2.832143008424802e-06,
+      "loss": 1.2363,
+      "step": 980
+    },
+    {
+      "epoch": 0.9583736689254598,
+      "grad_norm": 1.7360587308811468,
+      "learning_rate": 2.7932596875152747e-06,
+      "loss": 1.2573,
+      "step": 990
+    },
+    {
+      "epoch": 0.968054211035818,
+      "grad_norm": 1.6379992314547125,
+      "learning_rate": 2.754304297723862e-06,
+      "loss": 1.2403,
+      "step": 1000
+    },
+    {
+      "epoch": 0.968054211035818,
+      "eval_loss": 1.2926961183547974,
+      "eval_runtime": 12.1297,
+      "eval_samples_per_second": 55.072,
+      "eval_steps_per_second": 3.463,
+      "step": 1000
+    },
+    {
+      "epoch": 0.9777347531461762,
+      "grad_norm": 1.8316472126902184,
+      "learning_rate": 2.7152864123792716e-06,
+      "loss": 1.2915,
+      "step": 1010
+    },
+    {
+      "epoch": 0.9874152952565344,
+      "grad_norm": 2.0299073379393096,
+      "learning_rate": 2.6762156201685627e-06,
+      "loss": 1.2246,
+      "step": 1020
+    },
+    {
+      "epoch": 0.9970958373668926,
+      "grad_norm": 1.6668629036408331,
+      "learning_rate": 2.6371015227807127e-06,
+      "loss": 1.302,
+      "step": 1030
+    },
+    {
+      "epoch": 1.0067763794772506,
+      "grad_norm": 1.8978062087217702,
+      "learning_rate": 2.5979537325469913e-06,
+      "loss": 1.1438,
+      "step": 1040
+    },
+    {
+      "epoch": 1.016456921587609,
+      "grad_norm": 1.994568701365586,
+      "learning_rate": 2.558781870078722e-06,
+      "loss": 0.9893,
+      "step": 1050
+    },
+    {
+      "epoch": 1.026137463697967,
+      "grad_norm": 1.9256869273008883,
+      "learning_rate": 2.5195955619030064e-06,
+      "loss": 0.9725,
+      "step": 1060
+    },
+    {
+      "epoch": 1.0358180058083253,
+      "grad_norm": 2.107489644464238,
+      "learning_rate": 2.480404438096994e-06,
+      "loss": 0.9776,
+      "step": 1070
+    },
+    {
+      "epoch": 1.0454985479186834,
+      "grad_norm": 1.749601621960194,
+      "learning_rate": 2.441218129921278e-06,
+      "loss": 1.0161,
+      "step": 1080
+    },
+    {
+      "epoch": 1.0551790900290416,
+      "grad_norm": 1.9337057731043839,
+      "learning_rate": 2.402046267453009e-06,
+      "loss": 1.0164,
+      "step": 1090
+    },
+    {
+      "epoch": 1.0648596321393997,
+      "grad_norm": 2.1084811781965307,
+      "learning_rate": 2.3628984772192885e-06,
+      "loss": 0.9799,
+      "step": 1100
+    },
+    {
+      "epoch": 1.074540174249758,
+      "grad_norm": 2.3394701146918218,
+      "learning_rate": 2.323784379831438e-06,
+      "loss": 0.9829,
+      "step": 1110
+    },
+    {
+      "epoch": 1.084220716360116,
+      "grad_norm": 2.192823910061485,
+      "learning_rate": 2.2847135876207292e-06,
+      "loss": 0.9397,
+      "step": 1120
+    },
+    {
+      "epoch": 1.0939012584704744,
+      "grad_norm": 1.8908009391071738,
+      "learning_rate": 2.245695702276139e-06,
+      "loss": 0.9544,
+      "step": 1130
+    },
+    {
+      "epoch": 1.1035818005808324,
+      "grad_norm": 2.056158442927219,
+      "learning_rate": 2.2067403124847257e-06,
+      "loss": 0.9867,
+      "step": 1140
+    },
+    {
+      "epoch": 1.1132623426911907,
+      "grad_norm": 1.8426000838714471,
+      "learning_rate": 2.167856991575199e-06,
+      "loss": 0.9843,
+      "step": 1150
+    },
+    {
+      "epoch": 1.1229428848015488,
+      "grad_norm": 1.8272505713437244,
+      "learning_rate": 2.1290552951652614e-06,
+      "loss": 0.9621,
+      "step": 1160
+    },
+    {
+      "epoch": 1.132623426911907,
+      "grad_norm": 1.8843537624286948,
+      "learning_rate": 2.09034475881331e-06,
+      "loss": 1.0003,
+      "step": 1170
+    },
+    {
+      "epoch": 1.1423039690222652,
+      "grad_norm": 1.7985029833743278,
+      "learning_rate": 2.0517348956750597e-06,
+      "loss": 0.9598,
+      "step": 1180
+    },
+    {
+      "epoch": 1.1519845111326235,
+      "grad_norm": 1.9352819350548416,
+      "learning_rate": 2.0132351941656737e-06,
+      "loss": 0.9328,
+      "step": 1190
+    },
+    {
+      "epoch": 1.1616650532429815,
+      "grad_norm": 2.053591523251749,
+      "learning_rate": 1.9748551156279803e-06,
+      "loss": 0.9994,
+      "step": 1200
+    },
+    {
+      "epoch": 1.1616650532429815,
+      "eval_loss": 1.3311784267425537,
+      "eval_runtime": 11.038,
+      "eval_samples_per_second": 60.518,
+      "eval_steps_per_second": 3.805,
+      "step": 1200
+    },
+    {
+      "epoch": 1.1713455953533398,
+      "grad_norm": 1.8508139660473053,
+      "learning_rate": 1.93660409200733e-06,
+      "loss": 1.0021,
+      "step": 1210
+    },
+    {
+      "epoch": 1.181026137463698,
+      "grad_norm": 1.7834354892337438,
+      "learning_rate": 1.8984915235336934e-06,
+      "loss": 1.0096,
+      "step": 1220
+    },
+    {
+      "epoch": 1.1907066795740562,
+      "grad_norm": 1.9980460190401137,
+      "learning_rate": 1.860526776411539e-06,
+      "loss": 1.0096,
+      "step": 1230
+    },
+    {
+      "epoch": 1.2003872216844143,
+      "grad_norm": 1.9068932720510319,
+      "learning_rate": 1.8227191805180806e-06,
+      "loss": 1.0768,
+      "step": 1240
+    },
+    {
+      "epoch": 1.2100677637947725,
+      "grad_norm": 1.912490766033442,
+      "learning_rate": 1.7850780271104483e-06,
+      "loss": 1.0072,
+      "step": 1250
+    },
+    {
+      "epoch": 1.2197483059051306,
+      "grad_norm": 2.185779735989334,
+      "learning_rate": 1.747612566542356e-06,
+      "loss": 0.9954,
+      "step": 1260
+    },
+    {
+      "epoch": 1.229428848015489,
+      "grad_norm": 2.1799620122111283,
+      "learning_rate": 1.7103320059908093e-06,
+      "loss": 0.9856,
+      "step": 1270
+    },
+    {
+      "epoch": 1.239109390125847,
+      "grad_norm": 1.754880227600898,
+      "learning_rate": 1.6732455071934424e-06,
+      "loss": 0.9882,
+      "step": 1280
+    },
+    {
+      "epoch": 1.2487899322362053,
+      "grad_norm": 2.099244672028386,
+      "learning_rate": 1.6363621841970022e-06,
+      "loss": 0.9218,
+      "step": 1290
+    },
+    {
+      "epoch": 1.2584704743465633,
+      "grad_norm": 2.120833202856271,
+      "learning_rate": 1.5996911011175675e-06,
+      "loss": 0.9803,
+      "step": 1300
+    },
+    {
+      "epoch": 1.2681510164569216,
+      "grad_norm": 1.9306623752452063,
+      "learning_rate": 1.5632412699130306e-06,
+      "loss": 0.9732,
+      "step": 1310
+    },
+    {
+      "epoch": 1.2778315585672797,
+      "grad_norm": 1.929234322625883,
+      "learning_rate": 1.5270216481683954e-06,
+      "loss": 0.9656,
+      "step": 1320
+    },
+    {
+      "epoch": 1.287512100677638,
+      "grad_norm": 1.7522945520010809,
+      "learning_rate": 1.4910411368944483e-06,
+      "loss": 0.968,
+      "step": 1330
+    },
+    {
+      "epoch": 1.297192642787996,
+      "grad_norm": 2.0785744202804843,
+      "learning_rate": 1.4553085783403201e-06,
+      "loss": 1.0192,
+      "step": 1340
+    },
+    {
+      "epoch": 1.3068731848983544,
+      "grad_norm": 1.8612922010001036,
+      "learning_rate": 1.419832753820496e-06,
+      "loss": 0.9862,
+      "step": 1350
+    },
+    {
+      "epoch": 1.3165537270087124,
+      "grad_norm": 1.822665782350888,
+      "learning_rate": 1.3846223815568005e-06,
+      "loss": 0.9557,
+      "step": 1360
+    },
+    {
+      "epoch": 1.3262342691190707,
+      "grad_norm": 2.0230674241134565,
+      "learning_rate": 1.349686114535875e-06,
+      "loss": 0.9833,
+      "step": 1370
+    },
+    {
+      "epoch": 1.3359148112294288,
+      "grad_norm": 1.860216169744428,
+      "learning_rate": 1.3150325383827117e-06,
+      "loss": 1.0331,
+      "step": 1380
+    },
+    {
+      "epoch": 1.345595353339787,
+      "grad_norm": 1.7708281013805311,
+      "learning_rate": 1.2806701692507162e-06,
+      "loss": 1.0069,
+      "step": 1390
+    },
+    {
+      "epoch": 1.3552758954501452,
+      "grad_norm": 2.2229157558033057,
+      "learning_rate": 1.2466074517288558e-06,
+      "loss": 0.9531,
+      "step": 1400
+    },
+    {
+      "epoch": 1.3552758954501452,
+      "eval_loss": 1.3317842483520508,
+      "eval_runtime": 11.0567,
+      "eval_samples_per_second": 60.416,
+      "eval_steps_per_second": 3.799,
+      "step": 1400
+    },
+    {
+      "epoch": 1.3649564375605034,
+      "grad_norm": 2.2172077437876148,
+      "learning_rate": 1.212852756766399e-06,
+      "loss": 0.991,
+      "step": 1410
+    },
+    {
+      "epoch": 1.3746369796708615,
+      "grad_norm": 1.9872462249614302,
+      "learning_rate": 1.1794143796157358e-06,
+      "loss": 0.957,
+      "step": 1420
+    },
+    {
+      "epoch": 1.3843175217812198,
+      "grad_norm": 1.9736206439733366,
+      "learning_rate": 1.1463005377938182e-06,
+      "loss": 0.9406,
+      "step": 1430
+    },
+    {
+      "epoch": 1.3939980638915779,
+      "grad_norm": 1.868279202351412,
+      "learning_rate": 1.1135193690626926e-06,
+      "loss": 0.958,
+      "step": 1440
+    },
+    {
+      "epoch": 1.4036786060019362,
+      "grad_norm": 1.773679136583482,
+      "learning_rate": 1.0810789294296397e-06,
+      "loss": 1.0262,
+      "step": 1450
+    },
+    {
+      "epoch": 1.4133591481122942,
+      "grad_norm": 1.9372925542431774,
+      "learning_rate": 1.048987191167398e-06,
+      "loss": 0.9745,
+      "step": 1460
+    },
+    {
+      "epoch": 1.4230396902226525,
+      "grad_norm": 1.9130176009818487,
+      "learning_rate": 1.0172520408549716e-06,
+      "loss": 0.9759,
+      "step": 1470
+    },
+    {
+      "epoch": 1.4327202323330106,
+      "grad_norm": 1.7950360757272799,
+      "learning_rate": 9.858812774394946e-07,
+      "loss": 1.0117,
+      "step": 1480
+    },
+    {
+      "epoch": 1.442400774443369,
+      "grad_norm": 2.049973774645242,
+      "learning_rate": 9.548826103196304e-07,
+      "loss": 0.9736,
+      "step": 1490
+    },
+    {
+      "epoch": 1.452081316553727,
+      "grad_norm": 2.1258539830795975,
+      "learning_rate": 9.242636574509828e-07,
+      "loss": 1.002,
+      "step": 1500
+    },
+    {
+      "epoch": 1.4617618586640853,
+      "grad_norm": 2.005339561713667,
+      "learning_rate": 8.940319434739683e-07,
+      "loss": 1.0391,
+      "step": 1510
+    },
+    {
+      "epoch": 1.4714424007744433,
+      "grad_norm": 2.239017432444768,
+      "learning_rate": 8.641948978646361e-07,
+      "loss": 0.9864,
+      "step": 1520
+    },
+    {
+      "epoch": 1.4811229428848016,
+      "grad_norm": 1.9665985929622667,
+      "learning_rate": 8.347598531088555e-07,
+      "loss": 1.0425,
+      "step": 1530
+    },
+    {
+      "epoch": 1.4908034849951597,
+      "grad_norm": 2.183127365642101,
+      "learning_rate": 8.05734042900363e-07,
+      "loss": 1.0028,
+      "step": 1540
+    },
+    {
+      "epoch": 1.500484027105518,
+      "grad_norm": 1.9211536428186309,
+      "learning_rate": 7.771246003630625e-07,
+      "loss": 0.9764,
+      "step": 1550
+    },
+    {
+      "epoch": 1.510164569215876,
+      "grad_norm": 2.1375711365241647,
+      "learning_rate": 7.489385562980589e-07,
+      "loss": 0.9658,
+      "step": 1560
+    },
+    {
+      "epoch": 1.5198451113262341,
+      "grad_norm": 1.8686163130309794,
+      "learning_rate": 7.211828374558311e-07,
+      "loss": 0.9621,
+      "step": 1570
+    },
+    {
+      "epoch": 1.5295256534365924,
+      "grad_norm": 1.880595967296347,
+      "learning_rate": 6.938642648339719e-07,
+      "loss": 0.9874,
+      "step": 1580
+    },
+    {
+      "epoch": 1.5392061955469507,
+      "grad_norm": 1.9750877274994336,
+      "learning_rate": 6.669895520009239e-07,
+      "loss": 0.9481,
+      "step": 1590
+    },
+    {
+      "epoch": 1.5488867376573088,
+      "grad_norm": 2.074716832274806,
+      "learning_rate": 6.405653034461115e-07,
+      "loss": 0.9555,
+      "step": 1600
+    },
+    {
+      "epoch": 1.5488867376573088,
+      "eval_loss": 1.3306459188461304,
+      "eval_runtime": 11.0266,
+      "eval_samples_per_second": 60.581,
+      "eval_steps_per_second": 3.809,
+      "step": 1600
+    },
+    {
+      "epoch": 1.5585672797676668,
+      "grad_norm": 1.9848611004154655,
+      "learning_rate": 6.145980129568823e-07,
+      "loss": 1.0002,
+      "step": 1610
+    },
+    {
+      "epoch": 1.5682478218780251,
+      "grad_norm": 2.0724938891948956,
+      "learning_rate": 5.890940620226479e-07,
+      "loss": 1.0028,
+      "step": 1620
+    },
+    {
+      "epoch": 1.5779283639883834,
+      "grad_norm": 2.123553045180447,
+      "learning_rate": 5.640597182666324e-07,
+      "loss": 0.9734,
+      "step": 1630
+    },
+    {
+      "epoch": 1.5876089060987415,
+      "grad_norm": 1.914830200132737,
+      "learning_rate": 5.395011339055886e-07,
+      "loss": 0.976,
+      "step": 1640
+    },
+    {
+      "epoch": 1.5972894482090996,
+      "grad_norm": 2.171534814633674,
+      "learning_rate": 5.154243442378934e-07,
+      "loss": 0.9662,
+      "step": 1650
+    },
+    {
+      "epoch": 1.6069699903194579,
+      "grad_norm": 2.201937231455955,
+      "learning_rate": 4.918352661603604e-07,
+      "loss": 1.0096,
+      "step": 1660
+    },
+    {
+      "epoch": 1.6166505324298162,
+      "grad_norm": 2.1624331142573316,
+      "learning_rate": 4.687396967141583e-07,
+      "loss": 0.9944,
+      "step": 1670
+    },
+    {
+      "epoch": 1.6263310745401742,
+      "grad_norm": 2.037622724361002,
+      "learning_rate": 4.4614331166018403e-07,
+      "loss": 0.9976,
+      "step": 1680
+    },
+    {
+      "epoch": 1.6360116166505323,
+      "grad_norm": 2.206747564852964,
+      "learning_rate": 4.2405166408423154e-07,
+      "loss": 0.9404,
+      "step": 1690
+    },
+    {
+      "epoch": 1.6456921587608906,
+      "grad_norm": 1.8730992912712554,
+      "learning_rate": 4.0247018303232437e-07,
+      "loss": 0.9871,
+      "step": 1700
+    },
+    {
+      "epoch": 1.6553727008712489,
+      "grad_norm": 1.7877219207538437,
+      "learning_rate": 3.8140417217651437e-07,
+      "loss": 0.9848,
+      "step": 1710
+    },
+    {
+      "epoch": 1.665053242981607,
+      "grad_norm": 1.9978225206623983,
+      "learning_rate": 3.608588085115028e-07,
+      "loss": 1.0306,
+      "step": 1720
+    },
+    {
+      "epoch": 1.674733785091965,
+      "grad_norm": 2.016944090640306,
+      "learning_rate": 3.408391410823864e-07,
+      "loss": 0.9356,
+      "step": 1730
+    },
+    {
+      "epoch": 1.6844143272023233,
+      "grad_norm": 1.8659688480019923,
+      "learning_rate": 3.213500897438487e-07,
+      "loss": 0.9694,
+      "step": 1740
+    },
+    {
+      "epoch": 1.6940948693126816,
+      "grad_norm": 1.9980865391289462,
+      "learning_rate": 3.023964439511026e-07,
+      "loss": 0.9816,
+      "step": 1750
+    },
+    {
+      "epoch": 1.7037754114230397,
+      "grad_norm": 2.009781583476138,
+      "learning_rate": 2.839828615828744e-07,
+      "loss": 1.0213,
+      "step": 1760
+    },
+    {
+      "epoch": 1.7134559535333977,
+      "grad_norm": 1.8630110644291102,
+      "learning_rate": 2.6611386779672786e-07,
+      "loss": 0.9737,
+      "step": 1770
+    },
+    {
+      "epoch": 1.723136495643756,
+      "grad_norm": 2.0561453406581616,
+      "learning_rate": 2.487938539169982e-07,
+      "loss": 0.9567,
+      "step": 1780
+    },
+    {
+      "epoch": 1.7328170377541143,
+      "grad_norm": 2.298935987903855,
+      "learning_rate": 2.3202707635562371e-07,
+      "loss": 0.9943,
+      "step": 1790
+    },
+    {
+      "epoch": 1.7424975798644724,
+      "grad_norm": 1.9952588611867719,
+      "learning_rate": 2.1581765556612233e-07,
+      "loss": 0.9866,
+      "step": 1800
+    },
+    {
+      "epoch": 1.7424975798644724,
+      "eval_loss": 1.3297312259674072,
+      "eval_runtime": 12.1526,
+      "eval_samples_per_second": 54.968,
+      "eval_steps_per_second": 3.456,
+      "step": 1800
+    },
+    {
+      "epoch": 1.7521781219748305,
+      "grad_norm": 2.134773106840517,
+      "learning_rate": 2.001695750309926e-07,
+      "loss": 0.9948,
+      "step": 1810
+    },
+    {
+      "epoch": 1.7618586640851888,
+      "grad_norm": 2.144436773492719,
+      "learning_rate": 1.8508668028276305e-07,
+      "loss": 0.9888,
+      "step": 1820
+    },
+    {
+      "epoch": 1.771539206195547,
+      "grad_norm": 1.9517157284176012,
+      "learning_rate": 1.7057267795895117e-07,
+      "loss": 1.0625,
+      "step": 1830
+    },
+    {
+      "epoch": 1.7812197483059051,
+      "grad_norm": 2.1230795918159022,
+      "learning_rate": 1.566311348911534e-07,
+      "loss": 0.9834,
+      "step": 1840
+    },
+    {
+      "epoch": 1.7909002904162632,
+      "grad_norm": 2.0986406154174726,
+      "learning_rate": 1.4326547722848972e-07,
+      "loss": 0.9507,
+      "step": 1850
+    },
+    {
+      "epoch": 1.8005808325266215,
+      "grad_norm": 1.8949427782604102,
+      "learning_rate": 1.3047898959562767e-07,
+      "loss": 0.9997,
+      "step": 1860
+    },
+    {
+      "epoch": 1.8102613746369798,
+      "grad_norm": 2.2118114869360226,
+      "learning_rate": 1.1827481428557969e-07,
+      "loss": 0.9919,
+      "step": 1870
+    },
+    {
+      "epoch": 1.8199419167473379,
+      "grad_norm": 2.332050729229143,
+      "learning_rate": 1.0665595048748257e-07,
+      "loss": 0.9774,
+      "step": 1880
+    },
+    {
+      "epoch": 1.829622458857696,
+      "grad_norm": 1.990555321029242,
+      "learning_rate": 9.562525354954194e-08,
+      "loss": 1.0053,
+      "step": 1890
+    },
+    {
+      "epoch": 1.8393030009680542,
+      "grad_norm": 2.21958378472287,
+      "learning_rate": 8.518543427732951e-08,
+      "loss": 0.9824,
+      "step": 1900
+    },
+    {
+      "epoch": 1.8489835430784125,
+      "grad_norm": 1.9806545361274472,
+      "learning_rate": 7.53390582675978e-08,
+      "loss": 0.9876,
+      "step": 1910
+    },
+    {
+      "epoch": 1.8586640851887706,
+      "grad_norm": 2.115087225131469,
+      "learning_rate": 6.608854527778319e-08,
+      "loss": 0.9587,
+      "step": 1920
+    },
+    {
+      "epoch": 1.8683446272991286,
+      "grad_norm": 2.167614031987702,
+      "learning_rate": 5.743616863134793e-08,
+      "loss": 0.9771,
+      "step": 1930
+    },
+    {
+      "epoch": 1.878025169409487,
+      "grad_norm": 1.9710104273100284,
+      "learning_rate": 4.938405465910706e-08,
+      "loss": 0.9591,
+      "step": 1940
+    },
+    {
+      "epoch": 1.8877057115198452,
+      "grad_norm": 1.9445231599191304,
+      "learning_rate": 4.193418217668305e-08,
+      "loss": 0.9684,
+      "step": 1950
+    },
+    {
+      "epoch": 1.8973862536302033,
+      "grad_norm": 1.9461388693279436,
+      "learning_rate": 3.508838199820591e-08,
+      "loss": 0.9637,
+      "step": 1960
+    },
+    {
+      "epoch": 1.9070667957405614,
+      "grad_norm": 1.9423016553139207,
+      "learning_rate": 2.884833648639257e-08,
+      "loss": 1.0192,
+      "step": 1970
+    },
+    {
+      "epoch": 1.9167473378509197,
+      "grad_norm": 2.0557831910870727,
+      "learning_rate": 2.3215579139101996e-08,
+      "loss": 0.9587,
+      "step": 1980
+    },
+    {
+      "epoch": 1.926427879961278,
+      "grad_norm": 2.102313560231879,
+      "learning_rate": 1.8191494212477513e-08,
+      "loss": 0.9727,
+      "step": 1990
+    },
+    {
+      "epoch": 1.936108422071636,
+      "grad_norm": 2.0696636891849263,
+      "learning_rate": 1.3777316380763073e-08,
+      "loss": 0.9891,
+      "step": 2000
+    },
+    {
+      "epoch": 1.936108422071636,
+      "eval_loss": 1.328829288482666,
+      "eval_runtime": 11.0059,
+      "eval_samples_per_second": 60.695,
+      "eval_steps_per_second": 3.816,
+      "step": 2000
+    },
+    {
+      "epoch": 1.945788964181994,
+      "grad_norm": 1.8247235042196286,
+      "learning_rate": 9.9741304328832e-09,
+      "loss": 0.9352,
+      "step": 2010
+    },
+    {
+      "epoch": 1.9554695062923524,
+      "grad_norm": 2.289088556921837,
+      "learning_rate": 6.782871005851788e-09,
+      "loss": 0.9591,
+      "step": 2020
+    },
+    {
+      "epoch": 1.9651500484027107,
+      "grad_norm": 2.030627946550713,
+      "learning_rate": 4.2043223550869425e-09,
+      "loss": 0.9201,
+      "step": 2030
+    },
+    {
+      "epoch": 1.9748305905130688,
+      "grad_norm": 1.935525119332238,
+      "learning_rate": 2.239118161677656e-09,
+      "loss": 0.9517,
+      "step": 2040
+    },
+    {
+      "epoch": 1.9845111326234268,
+      "grad_norm": 1.8394321921167789,
+      "learning_rate": 8.877413766561482e-10,
+      "loss": 0.9299,
+      "step": 2050
+    },
+    {
+      "epoch": 1.9941916747337851,
+      "grad_norm": 2.1282656054902067,
+      "learning_rate": 1.5052410231336522e-10,
+      "loss": 0.9732,
+      "step": 2060
+    },
+    {
+      "epoch": 2.0,
+      "step": 2066,
+      "total_flos": 93553682546688.0,
+      "train_loss": 0.9335812793004201,
+      "train_runtime": 7476.0796,
+      "train_samples_per_second": 17.685,
+      "train_steps_per_second": 0.276
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 2066,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 2,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 93553682546688.0,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5aed7aca8c570c5b079a351c19de48e2066206f25c68c55d51c258dcb784d83
+size 8209

training_eval_loss.png ADDED Viewed

training_loss.png ADDED Viewed