Dontbeafed69 commited on
Commit
4810488
·
verified ·
1 Parent(s): d4a6c49

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,198 +1,207 @@
1
  ---
2
  base_model: google/gemma-3-270m
3
  library_name: peft
4
- license: mit
5
- tags:
6
- - chess
7
- - lora
8
- - mixture-of-experts
9
- - mps
10
- - apple-silicon
11
- - gemma
12
- - uci
13
- - chess-engine
14
- datasets:
15
- - lukifer23/gemmafischer-chess-training
16
- language:
17
- - en
18
  pipeline_tag: text-generation
 
 
 
 
19
  ---
20
 
21
- # GemmaFischer UCI Expert LoRA
22
 
23
- LoRA adapter for chess move generation in UCI format, trained on Google's Gemma-3 270M base model. This is the **UCI Expert** from the GemmaFischer Mixture of Experts chess system, optimized for Apple Silicon with MPS acceleration.
24
 
25
- ## Model Description
26
 
27
- This adapter specializes in generating legal chess moves in UCI (Universal Chess Interface) format. It's part of a 3-expert system including:
28
- - **UCI Expert** (this model): Fast move generation in UCI format
29
- - **Tutor Expert**: Detailed chess explanations and analysis
30
- - **Director Expert**: Strategic reasoning and Q&A
31
 
32
- ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ### Base Model
35
- - **Model**: google/gemma-3-270m
36
- - **Architecture**: Gemma-3 270M parameters
37
 
38
- ### LoRA Configuration
39
- - **Rank (r)**: 16
40
- - **Alpha**: 32
41
- - **Dropout**: 0.05
42
- - **Target Modules**: `q_proj`, `k_proj`, `v_proj`, `o_proj`
43
- - **Task Type**: Causal Language Modeling
 
 
 
 
 
 
 
 
 
44
 
45
  ### Training Data
46
- - **Dataset Size**: 50,000 chess positions
47
- - **Validation**: 100% Stockfish-verified legal moves
48
- - **Quality Score**: 0.8
49
- - **Format**: Standardized JSONL with metadata
50
-
51
- ### Training Metrics
52
- - **Total Steps**: 1,600
53
- - **Best Eval Loss**: 0.8723 (at step 1600)
54
- - **Final Training Loss**: 0.7017
55
- - **Training Platform**: Apple M3 Pro with MPS acceleration
56
- - **Training Speed**: ~2-3 steps/second
57
- - **Batch Size**: 1 with gradient accumulation
58
-
59
- ### Hardware & Optimization
60
- - **Platform**: Mac-only (M3 Pro)
61
- - **Acceleration**: MPS (Metal Performance Shaders)
62
- - **Memory Optimization**: Gradient checkpointing enabled
63
- - **Peak Memory**: ~3-5GB
64
-
65
- ## Usage
66
-
67
- ### Installation
68
- ```bash
69
- pip install transformers peft torch
70
- ```
71
-
72
- ### Loading the Model
73
- ```python
74
- from transformers import AutoModelForCausalLM, AutoTokenizer
75
- from peft import PeftModel
76
-
77
- # Load base model
78
- base_model = AutoModelForCausalLM.from_pretrained(
79
- "google/gemma-3-270m",
80
- device_map="mps", # For Apple Silicon
81
- torch_dtype="auto"
82
- )
83
-
84
- # Load LoRA adapter
85
- model = PeftModel.from_pretrained(
86
- base_model,
87
- "lukifer23/gemmafischer-uci-lora"
88
- )
89
-
90
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m")
91
- ```
92
-
93
- ### Generating UCI Moves
94
- ```python
95
- # Format: FEN position -> UCI move
96
- fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
97
- prompt = f"FEN: {fen}\nGenerate the best move in UCI format only:"
98
-
99
- inputs = tokenizer(prompt, return_tensors="pt").to("mps")
100
- outputs = model.generate(
101
- **inputs,
102
- max_new_tokens=5,
103
- do_sample=False, # Deterministic for UCI
104
- temperature=0.0
105
- )
106
-
107
- move = tokenizer.decode(outputs[0], skip_special_tokens=True)
108
- print(move) # e.g., "e2e4"
109
- ```
110
-
111
- ### Integration with Chess Software
112
- ```python
113
- import chess
114
-
115
- def get_uci_move(fen_position):
116
- """Generate UCI move for a given position."""
117
- prompt = f"FEN: {fen_position}\nGenerate the best move in UCI format only:"
118
- inputs = tokenizer(prompt, return_tensors="pt").to("mps")
119
- outputs = model.generate(**inputs, max_new_tokens=5, do_sample=False)
120
- move_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
121
-
122
- # Extract UCI move (format: e2e4, e7e8q for promotion)
123
- import re
124
- uci_match = re.search(r'[a-h][1-8][a-h][1-8][qrbn]?', move_text)
125
- return uci_match.group(0) if uci_match else None
126
-
127
- # Example usage
128
- board = chess.Board()
129
- uci_move = get_uci_move(board.fen())
130
- if uci_move:
131
- move = chess.Move.from_uci(uci_move)
132
- board.push(move)
133
- ```
134
-
135
- ## Performance
136
-
137
- ### Capabilities
138
- - **Move Legality**: 100% legal move generation (Stockfish validated)
139
- - **UCI Format**: Correct UCI notation (e.g., `e2e4`, `e7e8q`)
140
- - **Inference Speed**: ~0.4-0.5s per move on M3 Pro
141
- - **Special Moves**: Supports castling, en passant, promotions
142
-
143
- ### Limitations
144
- - Optimized for Apple Silicon MPS only
145
- - Not a strong chess engine (270M parameters)
146
- - Best used as part of MoE system with other experts
147
- - Requires base model access (Google Gemma-3)
148
-
149
- ## System Requirements
150
-
151
- - **Hardware**: Mac with Apple Silicon (M1/M2/M3/M4)
152
- - **RAM**: 8GB minimum, 16GB recommended
153
- - **macOS**: 12.0+ (for MPS support)
154
- - **Python**: 3.10+
155
-
156
- ## Related Models & Resources
157
-
158
- ### GemmaFischer Collection
159
- - **Tutor Expert**: [lukifer23/gemmafischer-tutor-lora](https://huggingface.co/lukifer23/gemmafischer-tutor-lora) (coming soon)
160
- - **Director Expert**: [lukifer23/gemmafischer-director-lora](https://huggingface.co/lukifer23/gemmafischer-director-lora) (coming soon)
161
- - **Training Dataset**: [lukifer23/gemmafischer-chess-training](https://huggingface.co/datasets/lukifer23/gemmafischer-chess-training) (coming soon)
162
-
163
- ### Repository
164
- - **GitHub**: [github.com/lukifer23/GemmaFischer](https://github.com/lukifer23/GemmaFischer)
165
- - **Documentation**: Full training guides, evaluation tools, and MoE system
166
- - **Web Interface**: Interactive chess board with expert switching
167
-
168
- ## Training Loss Curve
169
-
170
- The model was trained for 1,600 steps with evaluation every 100 steps:
171
- - Initial loss: 4.59 (step 1)
172
- - Best eval loss: 0.872 (step 1600)
173
- - Final training loss: 0.702 (step 1600)
174
-
175
- Steady convergence with cosine learning rate schedule from 1e-4 to near zero.
176
-
177
- ## Citation
178
-
179
- ```bibtex
180
- @misc{gemmafischer2025,
181
- author = {lukifer23},
182
- title = {GemmaFischer: Chess Engine and Tutor with Mixture of Experts},
183
- year = {2025},
184
- publisher = {HuggingFace},
185
- howpublished = {\url{https://huggingface.co/lukifer23/gemmafischer-uci-lora}}
186
- }
187
- ```
188
-
189
- ## License
190
-
191
- MIT License - See [LICENSE](https://github.com/lukifer23/GemmaFischer/blob/main/LICENSE) file for details.
192
-
193
- ## Acknowledgments
194
-
195
- - **Base Model**: Google's Gemma-3 270M
196
- - **Training Platform**: Apple Silicon (M3 Pro) with MPS
197
- - **Validation**: Stockfish chess engine
198
- - **Framework**: HuggingFace Transformers + PEFT
 
1
  ---
2
  base_model: google/gemma-3-270m
3
  library_name: peft
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:google/gemma-3-270m
7
+ - lora
8
+ - transformers
9
  ---
10
 
11
+ # Model Card for Model ID
12
 
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
 
 
15
 
 
 
 
 
16
 
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
 
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
66
 
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
 
83
  ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
+
176
+ ## Citation [optional]
177
+
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
+
180
+ **BibTeX:**
181
+
182
+ [More Information Needed]
183
+
184
+ **APA:**
185
+
186
+ [More Information Needed]
187
+
188
+ ## Glossary [optional]
189
+
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
+
192
+ [More Information Needed]
193
+
194
+ ## More Information [optional]
195
+
196
+ [More Information Needed]
197
+
198
+ ## Model Card Authors [optional]
199
+
200
+ [More Information Needed]
201
+
202
+ ## Model Card Contact
203
+
204
+ [More Information Needed]
205
+ ### Framework versions
206
+
207
+ - PEFT 0.17.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adapter_config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "alpha_pattern": {},
3
  "auto_mapping": null,
4
- "base_model_name_or_path": "/Users/admin/Downloads/VSCode/GemmaFischer/models/google-gemma-3-270m",
5
  "bias": "none",
6
  "corda_config": null,
7
  "eva_config": null,
@@ -13,7 +13,7 @@
13
  "layers_pattern": null,
14
  "layers_to_transform": null,
15
  "loftq_config": {},
16
- "lora_alpha": 32,
17
  "lora_bias": false,
18
  "lora_dropout": 0.05,
19
  "megatron_config": null,
@@ -21,14 +21,17 @@
21
  "modules_to_save": null,
22
  "peft_type": "LORA",
23
  "qalora_group_size": 16,
24
- "r": 16,
25
  "rank_pattern": {},
26
  "revision": null,
27
  "target_modules": [
 
 
 
28
  "k_proj",
29
  "o_proj",
30
- "q_proj",
31
- "v_proj"
32
  ],
33
  "target_parameters": null,
34
  "task_type": "CAUSAL_LM",
 
1
  {
2
  "alpha_pattern": {},
3
  "auto_mapping": null,
4
+ "base_model_name_or_path": "google/gemma-3-270m",
5
  "bias": "none",
6
  "corda_config": null,
7
  "eva_config": null,
 
13
  "layers_pattern": null,
14
  "layers_to_transform": null,
15
  "loftq_config": {},
16
+ "lora_alpha": 64,
17
  "lora_bias": false,
18
  "lora_dropout": 0.05,
19
  "megatron_config": null,
 
21
  "modules_to_save": null,
22
  "peft_type": "LORA",
23
  "qalora_group_size": 16,
24
+ "r": 32,
25
  "rank_pattern": {},
26
  "revision": null,
27
  "target_modules": [
28
+ "gate_proj",
29
+ "down_proj",
30
+ "up_proj",
31
  "k_proj",
32
  "o_proj",
33
+ "v_proj",
34
+ "q_proj"
35
  ],
36
  "target_parameters": null,
37
  "task_type": "CAUSAL_LM",
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:74d6d9edde51340678ce1ee14ae112b077fc53078404605c4d903803c3f67bbf
3
- size 5917192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:938b914519292237dea78823bff38d42b726382b54a5f3cd464add97d8d2bd25
3
+ size 30409120
optimizer.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4ee04aa80d94558c7cfc322b976fbbfaa7d7d2991535e15290a0c01d094b5cf2
3
- size 2156890246
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd5756df66c54278fe70b29b27a913fa01a64a7254a5221338f09897e0ce9588
3
+ size 2205934555
rng_state.pth CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a416021fcc136006bfe4651385bb006441ee5a161cc6b38aff634835fe44cadc
3
- size 13990
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66f70f09d3cb910592b4d2344caae1568a9b5043377429849d732244fbeec9cb
3
+ size 14391
scheduler.pt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:933821bed50a92dd2dc11b2ebd21a8303e761867bc574b4556159143c11330c7
3
- size 1064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d49cd60c3246e9d86bd41348cd643f8e1cabffcb1dda38d57f3cf7a26c4f60d4
3
+ size 1465
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ca2f60fd56eabb86ada6d0ef7c30d1ce71e1ed22af2d19e5238a9f0a5cdfa23c
3
- size 33384666
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4201e7b539fef153e1fe3058db39e600717b3323fee690d37e92fa52fb2b5af2
3
+ size 33384667
trainer_state.json CHANGED
@@ -1,2395 +1,1179 @@
1
  {
2
- "best_global_step": 1600,
3
- "best_metric": 0.8723308444023132,
4
- "best_model_checkpoint": "checkpoints/lora_uci/checkpoint-1600",
5
- "epoch": 0.14222222222222222,
6
- "eval_steps": 100,
7
- "global_step": 1600,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
- "epoch": 8.888888888888889e-05,
14
- "grad_norm": 34.336647033691406,
15
  "learning_rate": 0.0,
16
- "loss": 4.5963,
17
  "step": 1
18
  },
19
  {
20
- "epoch": 0.00044444444444444447,
21
- "grad_norm": 32.09209060668945,
22
- "learning_rate": 2.5e-06,
23
- "loss": 4.4684,
24
  "step": 5
25
  },
26
  {
27
- "epoch": 0.0008888888888888889,
28
- "grad_norm": 31.128032684326172,
29
- "learning_rate": 5.625e-06,
30
- "loss": 3.7103,
31
  "step": 10
32
  },
33
  {
34
- "epoch": 0.0013333333333333333,
35
- "grad_norm": 22.065799713134766,
36
- "learning_rate": 8.75e-06,
37
- "loss": 2.5875,
38
  "step": 15
39
  },
40
  {
41
- "epoch": 0.0017777777777777779,
42
- "grad_norm": 14.675823211669922,
43
- "learning_rate": 1.1875e-05,
44
- "loss": 1.7574,
45
  "step": 20
46
  },
47
  {
48
- "epoch": 0.0022222222222222222,
49
- "grad_norm": 15.899088859558105,
50
- "learning_rate": 1.5e-05,
51
- "loss": 1.426,
52
  "step": 25
53
  },
54
  {
55
- "epoch": 0.0026666666666666666,
56
- "grad_norm": 14.202256202697754,
57
- "learning_rate": 1.8125e-05,
58
- "loss": 1.2788,
59
  "step": 30
60
  },
61
  {
62
- "epoch": 0.003111111111111111,
63
- "grad_norm": 9.793079376220703,
64
- "learning_rate": 2.125e-05,
65
- "loss": 1.2216,
66
  "step": 35
67
  },
68
  {
69
- "epoch": 0.0035555555555555557,
70
- "grad_norm": 19.244722366333008,
71
- "learning_rate": 2.4375e-05,
72
- "loss": 1.2066,
73
  "step": 40
74
  },
75
  {
76
- "epoch": 0.004,
77
- "grad_norm": 8.825323104858398,
78
- "learning_rate": 2.7500000000000004e-05,
79
- "loss": 1.0881,
80
  "step": 45
81
  },
82
  {
83
- "epoch": 0.0044444444444444444,
84
- "grad_norm": 10.595223426818848,
85
- "learning_rate": 3.0625000000000006e-05,
86
- "loss": 1.0866,
87
  "step": 50
88
  },
89
  {
90
- "epoch": 0.004888888888888889,
91
- "grad_norm": 12.221918106079102,
92
- "learning_rate": 3.375000000000001e-05,
93
- "loss": 1.1606,
94
  "step": 55
95
  },
96
  {
97
- "epoch": 0.005333333333333333,
98
- "grad_norm": 11.6161527633667,
99
- "learning_rate": 3.6875e-05,
100
- "loss": 1.1047,
101
  "step": 60
102
  },
103
  {
104
- "epoch": 0.0057777777777777775,
105
- "grad_norm": 8.067273139953613,
106
- "learning_rate": 4e-05,
107
- "loss": 1.0627,
108
  "step": 65
109
  },
110
  {
111
- "epoch": 0.006222222222222222,
112
- "grad_norm": 12.541388511657715,
113
- "learning_rate": 4.3125000000000005e-05,
114
- "loss": 1.0847,
115
  "step": 70
116
  },
117
  {
118
- "epoch": 0.006666666666666667,
119
- "grad_norm": 11.718969345092773,
120
- "learning_rate": 4.6250000000000006e-05,
121
- "loss": 1.0382,
122
  "step": 75
123
  },
124
  {
125
- "epoch": 0.0071111111111111115,
126
- "grad_norm": 9.308419227600098,
127
- "learning_rate": 4.937500000000001e-05,
128
- "loss": 1.0483,
129
  "step": 80
130
  },
131
  {
132
- "epoch": 0.007555555555555556,
133
- "grad_norm": 6.772762298583984,
134
- "learning_rate": 5.25e-05,
135
- "loss": 1.009,
136
  "step": 85
137
  },
138
  {
139
- "epoch": 0.008,
140
- "grad_norm": 9.496241569519043,
141
- "learning_rate": 5.5625000000000004e-05,
142
- "loss": 0.9837,
143
  "step": 90
144
  },
145
  {
146
- "epoch": 0.008444444444444444,
147
- "grad_norm": 7.885592937469482,
148
- "learning_rate": 5.8750000000000005e-05,
149
- "loss": 1.0189,
150
  "step": 95
151
  },
152
  {
153
- "epoch": 0.008888888888888889,
154
- "grad_norm": 5.724958419799805,
155
- "learning_rate": 6.1875e-05,
156
- "loss": 1.0009,
157
- "step": 100
158
- },
159
- {
160
- "epoch": 0.008888888888888889,
161
- "eval_loss": 1.1365891695022583,
162
- "eval_runtime": 185.2718,
163
- "eval_samples_per_second": 26.987,
164
- "eval_steps_per_second": 3.373,
165
  "step": 100
166
  },
167
  {
168
- "epoch": 0.009333333333333334,
169
- "grad_norm": 7.011026859283447,
170
- "learning_rate": 6.500000000000001e-05,
171
- "loss": 1.0031,
172
  "step": 105
173
  },
174
  {
175
- "epoch": 0.009777777777777778,
176
- "grad_norm": 7.641518592834473,
177
- "learning_rate": 6.8125e-05,
178
- "loss": 0.997,
179
  "step": 110
180
  },
181
  {
182
- "epoch": 0.010222222222222223,
183
- "grad_norm": 9.401971817016602,
184
- "learning_rate": 7.125000000000001e-05,
185
- "loss": 1.0158,
186
  "step": 115
187
  },
188
  {
189
- "epoch": 0.010666666666666666,
190
- "grad_norm": 4.336047649383545,
191
- "learning_rate": 7.4375e-05,
192
- "loss": 0.9726,
193
  "step": 120
194
  },
195
  {
196
- "epoch": 0.011111111111111112,
197
- "grad_norm": 6.882427215576172,
198
- "learning_rate": 7.75e-05,
199
- "loss": 1.0175,
200
  "step": 125
201
  },
202
  {
203
- "epoch": 0.011555555555555555,
204
- "grad_norm": 5.442468643188477,
205
- "learning_rate": 8.062500000000001e-05,
206
- "loss": 0.936,
207
  "step": 130
208
  },
209
  {
210
- "epoch": 0.012,
211
- "grad_norm": 4.264267444610596,
212
- "learning_rate": 8.375e-05,
213
- "loss": 0.9527,
214
  "step": 135
215
  },
216
  {
217
- "epoch": 0.012444444444444444,
218
- "grad_norm": 5.994289398193359,
219
- "learning_rate": 8.687500000000001e-05,
220
- "loss": 0.9757,
221
  "step": 140
222
  },
223
  {
224
- "epoch": 0.012888888888888889,
225
- "grad_norm": 5.154539585113525,
226
- "learning_rate": 9e-05,
227
- "loss": 0.9634,
228
  "step": 145
229
  },
230
  {
231
- "epoch": 0.013333333333333334,
232
- "grad_norm": 5.39900541305542,
233
- "learning_rate": 9.3125e-05,
234
- "loss": 0.9563,
235
  "step": 150
236
  },
237
  {
238
- "epoch": 0.013777777777777778,
239
- "grad_norm": 5.613903522491455,
240
- "learning_rate": 9.625000000000001e-05,
241
- "loss": 0.9728,
242
  "step": 155
243
  },
244
  {
245
- "epoch": 0.014222222222222223,
246
- "grad_norm": 4.219268798828125,
247
- "learning_rate": 9.9375e-05,
248
- "loss": 0.9342,
249
  "step": 160
250
  },
251
  {
252
- "epoch": 0.014666666666666666,
253
- "grad_norm": 6.655375003814697,
254
- "learning_rate": 9.999809615320856e-05,
255
- "loss": 0.9196,
256
  "step": 165
257
  },
258
  {
259
- "epoch": 0.015111111111111112,
260
- "grad_norm": 4.512648582458496,
261
- "learning_rate": 9.999036202410325e-05,
262
- "loss": 1.0073,
263
  "step": 170
264
  },
265
  {
266
- "epoch": 0.015555555555555555,
267
- "grad_norm": 4.687705993652344,
268
- "learning_rate": 9.997667954183565e-05,
269
- "loss": 0.9762,
270
  "step": 175
271
  },
272
  {
273
- "epoch": 0.016,
274
- "grad_norm": 4.793412685394287,
275
- "learning_rate": 9.995705033448435e-05,
276
- "loss": 0.9198,
277
  "step": 180
278
  },
279
  {
280
- "epoch": 0.016444444444444446,
281
- "grad_norm": 5.910813808441162,
282
- "learning_rate": 9.99314767377287e-05,
283
- "loss": 0.9407,
284
  "step": 185
285
  },
286
  {
287
- "epoch": 0.016888888888888887,
288
- "grad_norm": 5.998133659362793,
289
- "learning_rate": 9.9899961794571e-05,
290
- "loss": 0.9476,
291
  "step": 190
292
  },
293
  {
294
- "epoch": 0.017333333333333333,
295
- "grad_norm": 4.035412311553955,
296
- "learning_rate": 9.986250925497429e-05,
297
- "loss": 0.9648,
298
  "step": 195
299
  },
300
  {
301
- "epoch": 0.017777777777777778,
302
- "grad_norm": 4.533802032470703,
303
- "learning_rate": 9.981912357541627e-05,
304
- "loss": 0.9095,
305
  "step": 200
306
  },
307
  {
308
- "epoch": 0.017777777777777778,
309
- "eval_loss": 1.080334186553955,
310
- "eval_runtime": 187.8444,
311
- "eval_samples_per_second": 26.618,
312
- "eval_steps_per_second": 3.327,
313
  "step": 200
314
  },
315
  {
316
- "epoch": 0.018222222222222223,
317
- "grad_norm": 4.582113265991211,
318
- "learning_rate": 9.976980991835894e-05,
319
- "loss": 0.9287,
320
  "step": 205
321
  },
322
  {
323
- "epoch": 0.018666666666666668,
324
- "grad_norm": 3.81330943107605,
325
- "learning_rate": 9.971457415163435e-05,
326
- "loss": 0.9538,
327
  "step": 210
328
  },
329
  {
330
- "epoch": 0.01911111111111111,
331
- "grad_norm": 4.3214430809021,
332
- "learning_rate": 9.965342284774632e-05,
333
- "loss": 0.9262,
334
  "step": 215
335
  },
336
  {
337
- "epoch": 0.019555555555555555,
338
- "grad_norm": 4.554825305938721,
339
- "learning_rate": 9.958636328308853e-05,
340
- "loss": 0.9172,
341
  "step": 220
342
  },
343
  {
344
- "epoch": 0.02,
345
- "grad_norm": 4.938535213470459,
346
- "learning_rate": 9.951340343707852e-05,
347
- "loss": 0.916,
348
  "step": 225
349
  },
350
  {
351
- "epoch": 0.020444444444444446,
352
- "grad_norm": 4.2901930809021,
353
- "learning_rate": 9.943455199120837e-05,
354
- "loss": 0.9322,
355
  "step": 230
356
  },
357
  {
358
- "epoch": 0.020888888888888887,
359
- "grad_norm": 4.4027228355407715,
360
- "learning_rate": 9.93498183280116e-05,
361
- "loss": 0.8971,
362
  "step": 235
363
  },
364
  {
365
- "epoch": 0.021333333333333333,
366
- "grad_norm": 4.342219829559326,
367
- "learning_rate": 9.925921252994676e-05,
368
- "loss": 0.9283,
369
  "step": 240
370
  },
371
  {
372
- "epoch": 0.021777777777777778,
373
- "grad_norm": 4.265533447265625,
374
- "learning_rate": 9.916274537819775e-05,
375
- "loss": 0.9491,
376
  "step": 245
377
  },
378
  {
379
- "epoch": 0.022222222222222223,
380
- "grad_norm": 3.2655222415924072,
381
- "learning_rate": 9.906042835139089e-05,
382
- "loss": 0.9067,
383
  "step": 250
384
  },
385
  {
386
- "epoch": 0.02266666666666667,
387
- "grad_norm": 5.214357852935791,
388
- "learning_rate": 9.89522736242292e-05,
389
- "loss": 0.9483,
390
  "step": 255
391
  },
392
  {
393
- "epoch": 0.02311111111111111,
394
- "grad_norm": 3.831639289855957,
395
- "learning_rate": 9.883829406604363e-05,
396
- "loss": 0.8565,
397
  "step": 260
398
  },
399
  {
400
- "epoch": 0.023555555555555555,
401
- "grad_norm": 3.9132847785949707,
402
- "learning_rate": 9.871850323926177e-05,
403
- "loss": 0.9348,
404
  "step": 265
405
  },
406
  {
407
- "epoch": 0.024,
408
- "grad_norm": 3.378690481185913,
409
- "learning_rate": 9.859291539779406e-05,
410
- "loss": 0.9033,
411
  "step": 270
412
  },
413
  {
414
- "epoch": 0.024444444444444446,
415
- "grad_norm": 4.511641025543213,
416
- "learning_rate": 9.846154548533773e-05,
417
- "loss": 0.908,
418
  "step": 275
419
  },
420
  {
421
- "epoch": 0.024888888888888887,
422
- "grad_norm": 2.6710896492004395,
423
- "learning_rate": 9.832440913359861e-05,
424
- "loss": 0.8716,
425
  "step": 280
426
  },
427
  {
428
- "epoch": 0.025333333333333333,
429
- "grad_norm": 3.3226819038391113,
430
- "learning_rate": 9.818152266043114e-05,
431
- "loss": 0.9229,
432
  "step": 285
433
  },
434
  {
435
- "epoch": 0.025777777777777778,
436
- "grad_norm": 2.8580386638641357,
437
- "learning_rate": 9.803290306789676e-05,
438
- "loss": 0.8817,
439
  "step": 290
440
  },
441
  {
442
- "epoch": 0.026222222222222223,
443
- "grad_norm": 3.8689441680908203,
444
- "learning_rate": 9.787856804024073e-05,
445
- "loss": 0.8654,
446
  "step": 295
447
  },
448
  {
449
- "epoch": 0.02666666666666667,
450
- "grad_norm": 5.108214855194092,
451
- "learning_rate": 9.771853594178791e-05,
452
- "loss": 0.9015,
453
- "step": 300
454
- },
455
- {
456
- "epoch": 0.02666666666666667,
457
- "eval_loss": 1.0304287672042847,
458
- "eval_runtime": 149.0721,
459
- "eval_samples_per_second": 33.541,
460
- "eval_steps_per_second": 4.193,
461
  "step": 300
462
  },
463
  {
464
- "epoch": 0.02711111111111111,
465
- "grad_norm": 3.7313003540039062,
466
- "learning_rate": 9.755282581475769e-05,
467
- "loss": 0.8804,
468
  "step": 305
469
  },
470
  {
471
- "epoch": 0.027555555555555555,
472
- "grad_norm": 2.914902687072754,
473
- "learning_rate": 9.738145737699799e-05,
474
- "loss": 0.8449,
475
  "step": 310
476
  },
477
  {
478
- "epoch": 0.028,
479
- "grad_norm": 3.4815356731414795,
480
- "learning_rate": 9.720445101963922e-05,
481
- "loss": 0.8962,
482
  "step": 315
483
  },
484
  {
485
- "epoch": 0.028444444444444446,
486
- "grad_norm": 3.9641096591949463,
487
- "learning_rate": 9.702182780466775e-05,
488
- "loss": 0.8539,
489
  "step": 320
490
  },
491
  {
492
- "epoch": 0.028888888888888888,
493
- "grad_norm": 3.4962000846862793,
494
- "learning_rate": 9.683360946241989e-05,
495
- "loss": 0.8779,
496
  "step": 325
497
  },
498
  {
499
- "epoch": 0.029333333333333333,
500
- "grad_norm": 3.8663835525512695,
501
- "learning_rate": 9.663981838899612e-05,
502
- "loss": 0.8943,
503
  "step": 330
504
  },
505
  {
506
- "epoch": 0.029777777777777778,
507
- "grad_norm": 3.476285934448242,
508
- "learning_rate": 9.644047764359622e-05,
509
- "loss": 0.8697,
510
  "step": 335
511
  },
512
  {
513
- "epoch": 0.030222222222222223,
514
- "grad_norm": 3.642954111099243,
515
- "learning_rate": 9.623561094577542e-05,
516
- "loss": 0.9362,
517
  "step": 340
518
  },
519
  {
520
- "epoch": 0.030666666666666665,
521
- "grad_norm": 3.1184070110321045,
522
- "learning_rate": 9.602524267262203e-05,
523
- "loss": 0.8999,
524
  "step": 345
525
  },
526
  {
527
- "epoch": 0.03111111111111111,
528
- "grad_norm": 3.0415987968444824,
529
- "learning_rate": 9.580939785585681e-05,
530
- "loss": 0.8613,
531
  "step": 350
532
  },
533
  {
534
- "epoch": 0.03155555555555556,
535
- "grad_norm": 2.9512388706207275,
536
- "learning_rate": 9.558810217885443e-05,
537
- "loss": 0.8499,
538
  "step": 355
539
  },
540
  {
541
- "epoch": 0.032,
542
- "grad_norm": 3.8605785369873047,
543
- "learning_rate": 9.536138197358745e-05,
544
- "loss": 0.8472,
545
  "step": 360
546
  },
547
  {
548
- "epoch": 0.03244444444444444,
549
- "grad_norm": 2.254096031188965,
550
- "learning_rate": 9.512926421749304e-05,
551
- "loss": 0.8302,
552
  "step": 365
553
  },
554
  {
555
- "epoch": 0.03288888888888889,
556
- "grad_norm": 2.6906585693359375,
557
- "learning_rate": 9.489177653026289e-05,
558
- "loss": 0.8939,
559
  "step": 370
560
  },
561
  {
562
- "epoch": 0.03333333333333333,
563
- "grad_norm": 2.434436559677124,
564
- "learning_rate": 9.464894717055686e-05,
565
- "loss": 0.9043,
566
  "step": 375
567
  },
568
  {
569
- "epoch": 0.033777777777777775,
570
- "grad_norm": 2.497917652130127,
571
- "learning_rate": 9.440080503264037e-05,
572
- "loss": 0.8737,
573
  "step": 380
574
  },
575
  {
576
- "epoch": 0.03422222222222222,
577
- "grad_norm": 4.014335632324219,
578
- "learning_rate": 9.414737964294636e-05,
579
- "loss": 0.8566,
580
  "step": 385
581
  },
582
  {
583
- "epoch": 0.034666666666666665,
584
- "grad_norm": 2.4953484535217285,
585
- "learning_rate": 9.388870115656184e-05,
586
- "loss": 0.8479,
587
  "step": 390
588
  },
589
  {
590
- "epoch": 0.035111111111111114,
591
- "grad_norm": 3.679469347000122,
592
- "learning_rate": 9.362480035363986e-05,
593
- "loss": 0.8702,
594
  "step": 395
595
  },
596
  {
597
- "epoch": 0.035555555555555556,
598
- "grad_norm": 3.756469249725342,
599
- "learning_rate": 9.335570863573686e-05,
600
- "loss": 0.8408,
601
  "step": 400
602
  },
603
  {
604
- "epoch": 0.035555555555555556,
605
- "eval_loss": 0.9743552803993225,
606
- "eval_runtime": 145.3606,
607
- "eval_samples_per_second": 34.397,
608
- "eval_steps_per_second": 4.3,
609
  "step": 400
610
  },
611
  {
612
- "epoch": 0.036,
613
- "grad_norm": 2.5384509563446045,
614
- "learning_rate": 9.308145802207629e-05,
615
- "loss": 0.7945,
616
  "step": 405
617
  },
618
  {
619
- "epoch": 0.036444444444444446,
620
- "grad_norm": 3.5420918464660645,
621
- "learning_rate": 9.280208114573859e-05,
622
- "loss": 0.8276,
623
  "step": 410
624
  },
625
  {
626
- "epoch": 0.03688888888888889,
627
- "grad_norm": 3.725031852722168,
628
- "learning_rate": 9.251761124977815e-05,
629
- "loss": 0.8379,
630
  "step": 415
631
  },
632
  {
633
- "epoch": 0.037333333333333336,
634
- "grad_norm": 3.2828762531280518,
635
- "learning_rate": 9.222808218326784e-05,
636
- "loss": 0.8136,
637
  "step": 420
638
  },
639
  {
640
- "epoch": 0.03777777777777778,
641
- "grad_norm": 3.535404682159424,
642
- "learning_rate": 9.193352839727121e-05,
643
- "loss": 0.8441,
644
  "step": 425
645
  },
646
  {
647
- "epoch": 0.03822222222222222,
648
- "grad_norm": 3.6818301677703857,
649
- "learning_rate": 9.163398494074314e-05,
650
- "loss": 0.824,
651
  "step": 430
652
  },
653
  {
654
- "epoch": 0.03866666666666667,
655
- "grad_norm": 3.2186279296875,
656
- "learning_rate": 9.132948745635944e-05,
657
- "loss": 0.867,
658
  "step": 435
659
  },
660
  {
661
- "epoch": 0.03911111111111111,
662
- "grad_norm": 3.6566619873046875,
663
- "learning_rate": 9.102007217627568e-05,
664
- "loss": 0.8889,
665
  "step": 440
666
  },
667
  {
668
- "epoch": 0.03955555555555555,
669
- "grad_norm": 2.5457677841186523,
670
- "learning_rate": 9.070577591781597e-05,
671
- "loss": 0.8509,
672
  "step": 445
673
  },
674
  {
675
- "epoch": 0.04,
676
- "grad_norm": 2.946967840194702,
677
- "learning_rate": 9.038663607909198e-05,
678
- "loss": 0.8356,
679
  "step": 450
680
  },
681
  {
682
- "epoch": 0.04044444444444444,
683
- "grad_norm": 3.6755688190460205,
684
- "learning_rate": 9.006269063455304e-05,
685
- "loss": 0.8134,
686
  "step": 455
687
  },
688
  {
689
- "epoch": 0.04088888888888889,
690
- "grad_norm": 3.2024929523468018,
691
- "learning_rate": 8.97339781304675e-05,
692
- "loss": 0.8292,
693
  "step": 460
694
  },
695
  {
696
- "epoch": 0.04133333333333333,
697
- "grad_norm": 3.4355175495147705,
698
- "learning_rate": 8.940053768033609e-05,
699
- "loss": 0.8325,
700
  "step": 465
701
  },
702
  {
703
- "epoch": 0.041777777777777775,
704
- "grad_norm": 3.882667303085327,
705
- "learning_rate": 8.906240896023794e-05,
706
- "loss": 0.8773,
707
  "step": 470
708
  },
709
  {
710
- "epoch": 0.042222222222222223,
711
- "grad_norm": 3.5231196880340576,
712
- "learning_rate": 8.871963220410928e-05,
713
- "loss": 0.8399,
714
  "step": 475
715
  },
716
  {
717
- "epoch": 0.042666666666666665,
718
- "grad_norm": 2.3692946434020996,
719
- "learning_rate": 8.837224819895626e-05,
720
- "loss": 0.8638,
721
  "step": 480
722
  },
723
  {
724
- "epoch": 0.043111111111111114,
725
- "grad_norm": 3.1206417083740234,
726
- "learning_rate": 8.802029828000156e-05,
727
- "loss": 0.8241,
728
  "step": 485
729
  },
730
  {
731
- "epoch": 0.043555555555555556,
732
- "grad_norm": 2.570483922958374,
733
- "learning_rate": 8.766382432576588e-05,
734
- "loss": 0.8265,
735
  "step": 490
736
  },
737
  {
738
- "epoch": 0.044,
739
- "grad_norm": 2.721163749694824,
740
- "learning_rate": 8.730286875308497e-05,
741
- "loss": 0.8191,
742
  "step": 495
743
  },
744
  {
745
- "epoch": 0.044444444444444446,
746
- "grad_norm": 2.883211851119995,
747
- "learning_rate": 8.693747451206232e-05,
748
- "loss": 0.8014,
749
- "step": 500
750
- },
751
- {
752
- "epoch": 0.044444444444444446,
753
- "eval_loss": 0.9622268676757812,
754
- "eval_runtime": 146.4776,
755
- "eval_samples_per_second": 34.135,
756
- "eval_steps_per_second": 4.267,
757
  "step": 500
758
  },
759
  {
760
- "epoch": 0.04488888888888889,
761
- "grad_norm": 2.902592897415161,
762
- "learning_rate": 8.656768508095853e-05,
763
- "loss": 0.8482,
764
  "step": 505
765
  },
766
  {
767
- "epoch": 0.04533333333333334,
768
- "grad_norm": 3.207852840423584,
769
- "learning_rate": 8.61935444610179e-05,
770
- "loss": 0.819,
771
  "step": 510
772
  },
773
  {
774
- "epoch": 0.04577777777777778,
775
- "grad_norm": 3.402653455734253,
776
- "learning_rate": 8.581509717123273e-05,
777
- "loss": 0.8495,
778
  "step": 515
779
  },
780
  {
781
- "epoch": 0.04622222222222222,
782
- "grad_norm": 2.159984827041626,
783
- "learning_rate": 8.543238824304584e-05,
784
- "loss": 0.8078,
785
  "step": 520
786
  },
787
  {
788
- "epoch": 0.04666666666666667,
789
- "grad_norm": 3.279927968978882,
790
- "learning_rate": 8.504546321499255e-05,
791
- "loss": 0.8831,
792
  "step": 525
793
  },
794
  {
795
- "epoch": 0.04711111111111111,
796
- "grad_norm": 3.268341064453125,
797
- "learning_rate": 8.46543681272818e-05,
798
- "loss": 0.8359,
799
  "step": 530
800
  },
801
  {
802
- "epoch": 0.04755555555555555,
803
- "grad_norm": 2.2996602058410645,
804
- "learning_rate": 8.425914951631795e-05,
805
- "loss": 0.8419,
806
  "step": 535
807
  },
808
  {
809
- "epoch": 0.048,
810
- "grad_norm": 3.5043487548828125,
811
- "learning_rate": 8.385985440916344e-05,
812
- "loss": 0.8337,
813
  "step": 540
814
  },
815
  {
816
- "epoch": 0.04844444444444444,
817
- "grad_norm": 2.0510051250457764,
818
- "learning_rate": 8.345653031794292e-05,
819
- "loss": 0.8375,
820
  "step": 545
821
  },
822
  {
823
- "epoch": 0.04888888888888889,
824
- "grad_norm": 2.8395752906799316,
825
- "learning_rate": 8.304922523418987e-05,
826
- "loss": 0.82,
827
  "step": 550
828
  },
829
  {
830
- "epoch": 0.04933333333333333,
831
- "grad_norm": 4.7278876304626465,
832
- "learning_rate": 8.263798762313612e-05,
833
- "loss": 0.8209,
834
  "step": 555
835
  },
836
  {
837
- "epoch": 0.049777777777777775,
838
- "grad_norm": 2.782799482345581,
839
- "learning_rate": 8.222286641794488e-05,
840
- "loss": 0.8109,
841
  "step": 560
842
  },
843
  {
844
- "epoch": 0.050222222222222224,
845
- "grad_norm": 2.960604190826416,
846
- "learning_rate": 8.18039110138882e-05,
847
- "loss": 0.8397,
848
  "step": 565
849
  },
850
  {
851
- "epoch": 0.050666666666666665,
852
- "grad_norm": 2.1768970489501953,
853
- "learning_rate": 8.138117126246951e-05,
854
- "loss": 0.7785,
855
  "step": 570
856
  },
857
  {
858
- "epoch": 0.051111111111111114,
859
- "grad_norm": 2.2641615867614746,
860
- "learning_rate": 8.095469746549172e-05,
861
- "loss": 0.8549,
862
  "step": 575
863
  },
864
  {
865
- "epoch": 0.051555555555555556,
866
- "grad_norm": 3.175459384918213,
867
- "learning_rate": 8.052454036907174e-05,
868
- "loss": 0.8181,
869
  "step": 580
870
  },
871
  {
872
- "epoch": 0.052,
873
- "grad_norm": 4.72011661529541,
874
- "learning_rate": 8.009075115760241e-05,
875
- "loss": 0.8396,
876
  "step": 585
877
  },
878
  {
879
- "epoch": 0.052444444444444446,
880
- "grad_norm": 3.0918285846710205,
881
- "learning_rate": 7.965338144766186e-05,
882
- "loss": 0.8967,
883
  "step": 590
884
  },
885
  {
886
- "epoch": 0.05288888888888889,
887
- "grad_norm": 3.0895960330963135,
888
- "learning_rate": 7.921248328187173e-05,
889
- "loss": 0.8236,
890
  "step": 595
891
  },
892
  {
893
- "epoch": 0.05333333333333334,
894
- "grad_norm": 2.9176504611968994,
895
- "learning_rate": 7.876810912270462e-05,
896
- "loss": 0.7833,
897
  "step": 600
898
  },
899
  {
900
- "epoch": 0.05333333333333334,
901
- "eval_loss": 0.9369513392448425,
902
- "eval_runtime": 145.5512,
903
- "eval_samples_per_second": 34.352,
904
- "eval_steps_per_second": 4.294,
905
  "step": 600
906
  },
907
  {
908
- "epoch": 0.05377777777777778,
909
- "grad_norm": 2.640820264816284,
910
- "learning_rate": 7.832031184624164e-05,
911
- "loss": 0.7855,
912
  "step": 605
913
  },
914
  {
915
- "epoch": 0.05422222222222222,
916
- "grad_norm": 2.6097235679626465,
917
- "learning_rate": 7.786914473588056e-05,
918
- "loss": 0.8043,
919
  "step": 610
920
  },
921
  {
922
- "epoch": 0.05466666666666667,
923
- "grad_norm": 2.815849781036377,
924
- "learning_rate": 7.74146614759957e-05,
925
- "loss": 0.8535,
926
  "step": 615
927
  },
928
  {
929
- "epoch": 0.05511111111111111,
930
- "grad_norm": 3.133481025695801,
931
- "learning_rate": 7.695691614555003e-05,
932
- "loss": 0.8229,
933
  "step": 620
934
  },
935
  {
936
- "epoch": 0.05555555555555555,
937
- "grad_norm": 2.641892910003662,
938
- "learning_rate": 7.649596321166024e-05,
939
- "loss": 0.7967,
940
  "step": 625
941
  },
942
  {
943
- "epoch": 0.056,
944
- "grad_norm": 3.032099723815918,
945
- "learning_rate": 7.603185752311587e-05,
946
- "loss": 0.812,
947
  "step": 630
948
  },
949
  {
950
- "epoch": 0.05644444444444444,
951
- "grad_norm": 2.820112466812134,
952
- "learning_rate": 7.55646543038526e-05,
953
- "loss": 0.7977,
954
  "step": 635
955
  },
956
  {
957
- "epoch": 0.05688888888888889,
958
- "grad_norm": 3.10481333732605,
959
- "learning_rate": 7.509440914638139e-05,
960
- "loss": 0.8705,
961
  "step": 640
962
  },
963
  {
964
- "epoch": 0.05733333333333333,
965
- "grad_norm": 2.5827136039733887,
966
- "learning_rate": 7.462117800517336e-05,
967
- "loss": 0.815,
968
  "step": 645
969
  },
970
  {
971
- "epoch": 0.057777777777777775,
972
- "grad_norm": 3.8412251472473145,
973
- "learning_rate": 7.414501719000187e-05,
974
- "loss": 0.8164,
975
  "step": 650
976
  },
977
  {
978
- "epoch": 0.058222222222222224,
979
- "grad_norm": 2.3787269592285156,
980
- "learning_rate": 7.366598335924217e-05,
981
- "loss": 0.8154,
982
  "step": 655
983
  },
984
  {
985
- "epoch": 0.058666666666666666,
986
- "grad_norm": 2.156470537185669,
987
- "learning_rate": 7.318413351312965e-05,
988
- "loss": 0.8122,
989
  "step": 660
990
  },
991
  {
992
- "epoch": 0.059111111111111114,
993
- "grad_norm": 2.743718385696411,
994
- "learning_rate": 7.269952498697734e-05,
995
- "loss": 0.8279,
996
  "step": 665
997
  },
998
  {
999
- "epoch": 0.059555555555555556,
1000
- "grad_norm": 2.574324369430542,
1001
- "learning_rate": 7.221221544435363e-05,
1002
- "loss": 0.8205,
1003
  "step": 670
1004
  },
1005
  {
1006
- "epoch": 0.06,
1007
- "grad_norm": 2.6374778747558594,
1008
- "learning_rate": 7.172226287022086e-05,
1009
- "loss": 0.786,
1010
  "step": 675
1011
  },
1012
  {
1013
- "epoch": 0.060444444444444446,
1014
- "grad_norm": 2.6063718795776367,
1015
- "learning_rate": 7.122972556403567e-05,
1016
- "loss": 0.7742,
1017
  "step": 680
1018
  },
1019
  {
1020
- "epoch": 0.06088888888888889,
1021
- "grad_norm": 2.1271631717681885,
1022
- "learning_rate": 7.073466213281196e-05,
1023
- "loss": 0.8303,
1024
  "step": 685
1025
  },
1026
  {
1027
- "epoch": 0.06133333333333333,
1028
- "grad_norm": 2.25993275642395,
1029
- "learning_rate": 7.023713148414727e-05,
1030
- "loss": 0.8154,
1031
  "step": 690
1032
  },
1033
  {
1034
- "epoch": 0.06177777777777778,
1035
- "grad_norm": 2.227431058883667,
1036
- "learning_rate": 6.973719281921335e-05,
1037
- "loss": 0.8394,
1038
  "step": 695
1039
  },
1040
  {
1041
- "epoch": 0.06222222222222222,
1042
- "grad_norm": 2.3561928272247314,
1043
- "learning_rate": 6.923490562571181e-05,
1044
- "loss": 0.815,
1045
- "step": 700
1046
- },
1047
- {
1048
- "epoch": 0.06222222222222222,
1049
- "eval_loss": 0.9396146535873413,
1050
- "eval_runtime": 174.3916,
1051
- "eval_samples_per_second": 28.671,
1052
- "eval_steps_per_second": 3.584,
1053
  "step": 700
1054
  },
1055
  {
1056
- "epoch": 0.06266666666666666,
1057
- "grad_norm": 2.8203611373901367,
1058
- "learning_rate": 6.873032967079561e-05,
1059
- "loss": 0.8072,
1060
  "step": 705
1061
  },
1062
  {
1063
- "epoch": 0.06311111111111112,
1064
- "grad_norm": 2.616844892501831,
1065
- "learning_rate": 6.82235249939575e-05,
1066
- "loss": 0.8132,
1067
  "step": 710
1068
  },
1069
  {
1070
- "epoch": 0.06355555555555556,
1071
- "grad_norm": 2.7529284954071045,
1072
- "learning_rate": 6.771455189988579e-05,
1073
- "loss": 0.8126,
1074
  "step": 715
1075
  },
1076
  {
1077
- "epoch": 0.064,
1078
- "grad_norm": 2.466383218765259,
1079
- "learning_rate": 6.720347095128884e-05,
1080
- "loss": 0.8174,
1081
  "step": 720
1082
  },
1083
  {
1084
- "epoch": 0.06444444444444444,
1085
- "grad_norm": 2.2590644359588623,
1086
- "learning_rate": 6.669034296168855e-05,
1087
- "loss": 0.8065,
1088
  "step": 725
1089
  },
1090
  {
1091
- "epoch": 0.06488888888888888,
1092
- "grad_norm": 2.241419792175293,
1093
- "learning_rate": 6.617522898818426e-05,
1094
- "loss": 0.8332,
1095
  "step": 730
1096
  },
1097
  {
1098
- "epoch": 0.06533333333333333,
1099
- "grad_norm": 2.259533405303955,
1100
- "learning_rate": 6.565819032418747e-05,
1101
- "loss": 0.8599,
1102
  "step": 735
1103
  },
1104
  {
1105
- "epoch": 0.06577777777777778,
1106
- "grad_norm": 2.110358715057373,
1107
- "learning_rate": 6.513928849212873e-05,
1108
- "loss": 0.795,
1109
  "step": 740
1110
  },
1111
  {
1112
- "epoch": 0.06622222222222222,
1113
- "grad_norm": 2.656036615371704,
1114
- "learning_rate": 6.461858523613684e-05,
1115
- "loss": 0.8161,
1116
  "step": 745
1117
  },
1118
  {
1119
- "epoch": 0.06666666666666667,
1120
- "grad_norm": 3.0978121757507324,
1121
- "learning_rate": 6.409614251469208e-05,
1122
- "loss": 0.8104,
1123
  "step": 750
1124
  },
1125
  {
1126
- "epoch": 0.06711111111111111,
1127
- "grad_norm": 2.494825839996338,
1128
- "learning_rate": 6.357202249325371e-05,
1129
- "loss": 0.791,
1130
  "step": 755
1131
  },
1132
  {
1133
- "epoch": 0.06755555555555555,
1134
- "grad_norm": 2.344874143600464,
1135
- "learning_rate": 6.304628753686295e-05,
1136
- "loss": 0.8195,
1137
  "step": 760
1138
  },
1139
  {
1140
- "epoch": 0.068,
1141
- "grad_norm": 2.4682934284210205,
1142
- "learning_rate": 6.251900020272208e-05,
1143
- "loss": 0.7791,
1144
  "step": 765
1145
  },
1146
  {
1147
- "epoch": 0.06844444444444445,
1148
- "grad_norm": 2.29433012008667,
1149
- "learning_rate": 6.199022323275083e-05,
1150
- "loss": 0.8252,
1151
  "step": 770
1152
  },
1153
  {
1154
- "epoch": 0.06888888888888889,
1155
- "grad_norm": 1.965577483177185,
1156
- "learning_rate": 6.146001954612071e-05,
1157
- "loss": 0.8046,
1158
  "step": 775
1159
  },
1160
  {
1161
- "epoch": 0.06933333333333333,
1162
- "grad_norm": 2.1349830627441406,
1163
- "learning_rate": 6.092845223176823e-05,
1164
- "loss": 0.82,
1165
  "step": 780
1166
  },
1167
  {
1168
- "epoch": 0.06977777777777777,
1169
- "grad_norm": 2.2359840869903564,
1170
- "learning_rate": 6.0395584540887963e-05,
1171
- "loss": 0.8138,
1172
  "step": 785
1173
  },
1174
  {
1175
- "epoch": 0.07022222222222223,
1176
- "grad_norm": 2.470207691192627,
1177
- "learning_rate": 5.9861479879406315e-05,
1178
- "loss": 0.771,
1179
  "step": 790
1180
  },
1181
  {
1182
- "epoch": 0.07066666666666667,
1183
- "grad_norm": 1.9428234100341797,
1184
- "learning_rate": 5.932620180043674e-05,
1185
- "loss": 0.7997,
1186
  "step": 795
1187
  },
1188
  {
1189
- "epoch": 0.07111111111111111,
1190
- "grad_norm": 2.2809956073760986,
1191
- "learning_rate": 5.8789813996717736e-05,
1192
- "loss": 0.8113,
1193
  "step": 800
1194
  },
1195
  {
1196
- "epoch": 0.07111111111111111,
1197
- "eval_loss": 0.9512593746185303,
1198
- "eval_runtime": 156.0458,
1199
- "eval_samples_per_second": 32.042,
1200
- "eval_steps_per_second": 4.005,
1201
  "step": 800
1202
- },
1203
- {
1204
- "epoch": 0.07155555555555555,
1205
- "grad_norm": 2.4197585582733154,
1206
- "learning_rate": 5.8252380293033884e-05,
1207
- "loss": 0.8103,
1208
- "step": 805
1209
- },
1210
- {
1211
- "epoch": 0.072,
1212
- "grad_norm": 3.481379747390747,
1213
- "learning_rate": 5.7713964638621444e-05,
1214
- "loss": 0.8354,
1215
- "step": 810
1216
- },
1217
- {
1218
- "epoch": 0.07244444444444445,
1219
- "grad_norm": 3.0828964710235596,
1220
- "learning_rate": 5.717463109955896e-05,
1221
- "loss": 0.814,
1222
- "step": 815
1223
- },
1224
- {
1225
- "epoch": 0.07288888888888889,
1226
- "grad_norm": 2.0905721187591553,
1227
- "learning_rate": 5.663444385114411e-05,
1228
- "loss": 0.7695,
1229
- "step": 820
1230
- },
1231
- {
1232
- "epoch": 0.07333333333333333,
1233
- "grad_norm": 3.2763991355895996,
1234
- "learning_rate": 5.6093467170257374e-05,
1235
- "loss": 0.7864,
1236
- "step": 825
1237
- },
1238
- {
1239
- "epoch": 0.07377777777777778,
1240
- "grad_norm": 2.24617862701416,
1241
- "learning_rate": 5.5551765427713884e-05,
1242
- "loss": 0.7314,
1243
- "step": 830
1244
- },
1245
- {
1246
- "epoch": 0.07422222222222222,
1247
- "grad_norm": 2.808973789215088,
1248
- "learning_rate": 5.5009403080603815e-05,
1249
- "loss": 0.8163,
1250
- "step": 835
1251
- },
1252
- {
1253
- "epoch": 0.07466666666666667,
1254
- "grad_norm": 2.1906187534332275,
1255
- "learning_rate": 5.4466444664622685e-05,
1256
- "loss": 0.7868,
1257
- "step": 840
1258
- },
1259
- {
1260
- "epoch": 0.07511111111111111,
1261
- "grad_norm": 2.268329381942749,
1262
- "learning_rate": 5.392295478639225e-05,
1263
- "loss": 0.7765,
1264
- "step": 845
1265
- },
1266
- {
1267
- "epoch": 0.07555555555555556,
1268
- "grad_norm": 2.3029792308807373,
1269
- "learning_rate": 5.337899811577296e-05,
1270
- "loss": 0.7739,
1271
- "step": 850
1272
- },
1273
- {
1274
- "epoch": 0.076,
1275
- "grad_norm": 2.1580140590667725,
1276
- "learning_rate": 5.283463937816888e-05,
1277
- "loss": 0.7358,
1278
- "step": 855
1279
- },
1280
- {
1281
- "epoch": 0.07644444444444444,
1282
- "grad_norm": 1.9017610549926758,
1283
- "learning_rate": 5.228994334682604e-05,
1284
- "loss": 0.7553,
1285
- "step": 860
1286
- },
1287
- {
1288
- "epoch": 0.0768888888888889,
1289
- "grad_norm": 2.5367610454559326,
1290
- "learning_rate": 5.174497483512506e-05,
1291
- "loss": 0.7988,
1292
- "step": 865
1293
- },
1294
- {
1295
- "epoch": 0.07733333333333334,
1296
- "grad_norm": 2.2793335914611816,
1297
- "learning_rate": 5.119979868886895e-05,
1298
- "loss": 0.7736,
1299
- "step": 870
1300
- },
1301
- {
1302
- "epoch": 0.07777777777777778,
1303
- "grad_norm": 2.226646900177002,
1304
- "learning_rate": 5.0654479778567223e-05,
1305
- "loss": 0.7659,
1306
- "step": 875
1307
- },
1308
- {
1309
- "epoch": 0.07822222222222222,
1310
- "grad_norm": 2.1985228061676025,
1311
- "learning_rate": 5.010908299171685e-05,
1312
- "loss": 0.7584,
1313
- "step": 880
1314
- },
1315
- {
1316
- "epoch": 0.07866666666666666,
1317
- "grad_norm": 2.9187963008880615,
1318
- "learning_rate": 4.9563673225081314e-05,
1319
- "loss": 0.7747,
1320
- "step": 885
1321
- },
1322
- {
1323
- "epoch": 0.0791111111111111,
1324
- "grad_norm": 2.6130077838897705,
1325
- "learning_rate": 4.901831537696859e-05,
1326
- "loss": 0.7689,
1327
- "step": 890
1328
- },
1329
- {
1330
- "epoch": 0.07955555555555556,
1331
- "grad_norm": 2.0770986080169678,
1332
- "learning_rate": 4.8473074339508875e-05,
1333
- "loss": 0.7608,
1334
- "step": 895
1335
- },
1336
- {
1337
- "epoch": 0.08,
1338
- "grad_norm": 2.1071507930755615,
1339
- "learning_rate": 4.792801499093305e-05,
1340
- "loss": 0.7597,
1341
- "step": 900
1342
- },
1343
- {
1344
- "epoch": 0.08,
1345
- "eval_loss": 0.9028043746948242,
1346
- "eval_runtime": 151.7758,
1347
- "eval_samples_per_second": 32.943,
1348
- "eval_steps_per_second": 4.118,
1349
- "step": 900
1350
- },
1351
- {
1352
- "epoch": 0.08044444444444444,
1353
- "grad_norm": 2.295839309692383,
1354
- "learning_rate": 4.738320218785281e-05,
1355
- "loss": 0.7514,
1356
- "step": 905
1357
- },
1358
- {
1359
- "epoch": 0.08088888888888889,
1360
- "grad_norm": 1.9599803686141968,
1361
- "learning_rate": 4.683870075754347e-05,
1362
- "loss": 0.7633,
1363
- "step": 910
1364
- },
1365
- {
1366
- "epoch": 0.08133333333333333,
1367
- "grad_norm": 2.3032443523406982,
1368
- "learning_rate": 4.629457549023004e-05,
1369
- "loss": 0.7607,
1370
- "step": 915
1371
- },
1372
- {
1373
- "epoch": 0.08177777777777778,
1374
- "grad_norm": 2.9767608642578125,
1375
- "learning_rate": 4.575089113137792e-05,
1376
- "loss": 0.8124,
1377
- "step": 920
1378
- },
1379
- {
1380
- "epoch": 0.08222222222222222,
1381
- "grad_norm": 2.6544406414031982,
1382
- "learning_rate": 4.52077123739888e-05,
1383
- "loss": 0.799,
1384
- "step": 925
1385
- },
1386
- {
1387
- "epoch": 0.08266666666666667,
1388
- "grad_norm": 1.9514062404632568,
1389
- "learning_rate": 4.466510385090287e-05,
1390
- "loss": 0.7782,
1391
- "step": 930
1392
- },
1393
- {
1394
- "epoch": 0.08311111111111111,
1395
- "grad_norm": 2.3520166873931885,
1396
- "learning_rate": 4.412313012710813e-05,
1397
- "loss": 0.8328,
1398
- "step": 935
1399
- },
1400
- {
1401
- "epoch": 0.08355555555555555,
1402
- "grad_norm": 2.1971933841705322,
1403
- "learning_rate": 4.358185569205779e-05,
1404
- "loss": 0.7903,
1405
- "step": 940
1406
- },
1407
- {
1408
- "epoch": 0.084,
1409
- "grad_norm": 2.2344958782196045,
1410
- "learning_rate": 4.3041344951996746e-05,
1411
- "loss": 0.7288,
1412
- "step": 945
1413
- },
1414
- {
1415
- "epoch": 0.08444444444444445,
1416
- "grad_norm": 1.9887062311172485,
1417
- "learning_rate": 4.250166222229774e-05,
1418
- "loss": 0.7817,
1419
- "step": 950
1420
- },
1421
- {
1422
- "epoch": 0.08488888888888889,
1423
- "grad_norm": 2.5852599143981934,
1424
- "learning_rate": 4.196287171980869e-05,
1425
- "loss": 0.8126,
1426
- "step": 955
1427
- },
1428
- {
1429
- "epoch": 0.08533333333333333,
1430
- "grad_norm": 2.0287818908691406,
1431
- "learning_rate": 4.142503755521129e-05,
1432
- "loss": 0.8016,
1433
- "step": 960
1434
- },
1435
- {
1436
- "epoch": 0.08577777777777777,
1437
- "grad_norm": 2.318622589111328,
1438
- "learning_rate": 4.088822372539263e-05,
1439
- "loss": 0.7858,
1440
- "step": 965
1441
- },
1442
- {
1443
- "epoch": 0.08622222222222223,
1444
- "grad_norm": 1.7345527410507202,
1445
- "learning_rate": 4.035249410583016e-05,
1446
- "loss": 0.7737,
1447
- "step": 970
1448
- },
1449
- {
1450
- "epoch": 0.08666666666666667,
1451
- "grad_norm": 1.9019683599472046,
1452
- "learning_rate": 3.981791244299113e-05,
1453
- "loss": 0.7344,
1454
- "step": 975
1455
- },
1456
- {
1457
- "epoch": 0.08711111111111111,
1458
- "grad_norm": 2.2720935344696045,
1459
- "learning_rate": 3.928454234674747e-05,
1460
- "loss": 0.7723,
1461
- "step": 980
1462
- },
1463
- {
1464
- "epoch": 0.08755555555555555,
1465
- "grad_norm": 2.1315135955810547,
1466
- "learning_rate": 3.875244728280676e-05,
1467
- "loss": 0.799,
1468
- "step": 985
1469
- },
1470
- {
1471
- "epoch": 0.088,
1472
- "grad_norm": 2.0208346843719482,
1473
- "learning_rate": 3.82216905651605e-05,
1474
- "loss": 0.793,
1475
- "step": 990
1476
- },
1477
- {
1478
- "epoch": 0.08844444444444445,
1479
- "grad_norm": 2.7285315990448,
1480
- "learning_rate": 3.769233534855035e-05,
1481
- "loss": 0.7506,
1482
- "step": 995
1483
- },
1484
- {
1485
- "epoch": 0.08888888888888889,
1486
- "grad_norm": 2.095430374145508,
1487
- "learning_rate": 3.7164444620953396e-05,
1488
- "loss": 0.7534,
1489
- "step": 1000
1490
- },
1491
- {
1492
- "epoch": 0.08888888888888889,
1493
- "eval_loss": 0.8925400376319885,
1494
- "eval_runtime": 146.2458,
1495
- "eval_samples_per_second": 34.189,
1496
- "eval_steps_per_second": 4.274,
1497
- "step": 1000
1498
- },
1499
- {
1500
- "epoch": 0.08933333333333333,
1501
- "grad_norm": 2.029069423675537,
1502
- "learning_rate": 3.663808119608716e-05,
1503
- "loss": 0.792,
1504
- "step": 1005
1505
- },
1506
- {
1507
- "epoch": 0.08977777777777778,
1508
- "grad_norm": 2.4296746253967285,
1509
- "learning_rate": 3.6113307705935396e-05,
1510
- "loss": 0.7631,
1511
- "step": 1010
1512
- },
1513
- {
1514
- "epoch": 0.09022222222222222,
1515
- "grad_norm": 1.9055721759796143,
1516
- "learning_rate": 3.559018659329554e-05,
1517
- "loss": 0.764,
1518
- "step": 1015
1519
- },
1520
- {
1521
- "epoch": 0.09066666666666667,
1522
- "grad_norm": 1.9428242444992065,
1523
- "learning_rate": 3.506878010434863e-05,
1524
- "loss": 0.7671,
1525
- "step": 1020
1526
- },
1527
- {
1528
- "epoch": 0.09111111111111111,
1529
- "grad_norm": 2.55059552192688,
1530
- "learning_rate": 3.4549150281252636e-05,
1531
- "loss": 0.7873,
1532
- "step": 1025
1533
- },
1534
- {
1535
- "epoch": 0.09155555555555556,
1536
- "grad_norm": 1.9363492727279663,
1537
- "learning_rate": 3.403135895476004e-05,
1538
- "loss": 0.7592,
1539
- "step": 1030
1540
- },
1541
- {
1542
- "epoch": 0.092,
1543
- "grad_norm": 2.3893046379089355,
1544
- "learning_rate": 3.351546773686065e-05,
1545
- "loss": 0.7718,
1546
- "step": 1035
1547
- },
1548
- {
1549
- "epoch": 0.09244444444444444,
1550
- "grad_norm": 1.9245718717575073,
1551
- "learning_rate": 3.300153801345028e-05,
1552
- "loss": 0.7403,
1553
- "step": 1040
1554
- },
1555
- {
1556
- "epoch": 0.09288888888888888,
1557
- "grad_norm": 1.7629024982452393,
1558
- "learning_rate": 3.248963093702663e-05,
1559
- "loss": 0.7999,
1560
- "step": 1045
1561
- },
1562
- {
1563
- "epoch": 0.09333333333333334,
1564
- "grad_norm": 2.3090808391571045,
1565
- "learning_rate": 3.197980741941252e-05,
1566
- "loss": 0.7815,
1567
- "step": 1050
1568
- },
1569
- {
1570
- "epoch": 0.09377777777777778,
1571
- "grad_norm": 2.262960433959961,
1572
- "learning_rate": 3.147212812450819e-05,
1573
- "loss": 0.7737,
1574
- "step": 1055
1575
- },
1576
- {
1577
- "epoch": 0.09422222222222222,
1578
- "grad_norm": 2.4823575019836426,
1579
- "learning_rate": 3.096665346107278e-05,
1580
- "loss": 0.7961,
1581
- "step": 1060
1582
- },
1583
- {
1584
- "epoch": 0.09466666666666666,
1585
- "grad_norm": 2.411437511444092,
1586
- "learning_rate": 3.046344357553632e-05,
1587
- "loss": 0.8292,
1588
- "step": 1065
1589
- },
1590
- {
1591
- "epoch": 0.0951111111111111,
1592
- "grad_norm": 2.0698482990264893,
1593
- "learning_rate": 2.996255834484296e-05,
1594
- "loss": 0.7709,
1595
- "step": 1070
1596
- },
1597
- {
1598
- "epoch": 0.09555555555555556,
1599
- "grad_norm": 1.6237046718597412,
1600
- "learning_rate": 2.946405736932615e-05,
1601
- "loss": 0.7675,
1602
- "step": 1075
1603
- },
1604
- {
1605
- "epoch": 0.096,
1606
- "grad_norm": 2.6146557331085205,
1607
- "learning_rate": 2.8967999965616816e-05,
1608
- "loss": 0.7564,
1609
- "step": 1080
1610
- },
1611
- {
1612
- "epoch": 0.09644444444444444,
1613
- "grad_norm": 2.2040791511535645,
1614
- "learning_rate": 2.8474445159585235e-05,
1615
- "loss": 0.733,
1616
- "step": 1085
1617
- },
1618
- {
1619
- "epoch": 0.09688888888888889,
1620
- "grad_norm": 2.3044800758361816,
1621
- "learning_rate": 2.7983451679317706e-05,
1622
- "loss": 0.7705,
1623
- "step": 1090
1624
- },
1625
- {
1626
- "epoch": 0.09733333333333333,
1627
- "grad_norm": 2.10251784324646,
1628
- "learning_rate": 2.7495077948128245e-05,
1629
- "loss": 0.7545,
1630
- "step": 1095
1631
- },
1632
- {
1633
- "epoch": 0.09777777777777778,
1634
- "grad_norm": 2.353555202484131,
1635
- "learning_rate": 2.700938207760701e-05,
1636
- "loss": 0.7512,
1637
- "step": 1100
1638
- },
1639
- {
1640
- "epoch": 0.09777777777777778,
1641
- "eval_loss": 0.8821930885314941,
1642
- "eval_runtime": 144.8568,
1643
- "eval_samples_per_second": 34.517,
1644
- "eval_steps_per_second": 4.315,
1645
- "step": 1100
1646
- },
1647
- {
1648
- "epoch": 0.09822222222222222,
1649
- "grad_norm": 2.295103073120117,
1650
- "learning_rate": 2.6526421860705473e-05,
1651
- "loss": 0.7454,
1652
- "step": 1105
1653
- },
1654
- {
1655
- "epoch": 0.09866666666666667,
1656
- "grad_norm": 2.361027956008911,
1657
- "learning_rate": 2.6046254764859685e-05,
1658
- "loss": 0.6993,
1659
- "step": 1110
1660
- },
1661
- {
1662
- "epoch": 0.09911111111111111,
1663
- "grad_norm": 2.4085135459899902,
1664
- "learning_rate": 2.556893792515227e-05,
1665
- "loss": 0.7861,
1666
- "step": 1115
1667
- },
1668
- {
1669
- "epoch": 0.09955555555555555,
1670
- "grad_norm": 1.8174635171890259,
1671
- "learning_rate": 2.5094528137513795e-05,
1672
- "loss": 0.8115,
1673
- "step": 1120
1674
- },
1675
- {
1676
- "epoch": 0.1,
1677
- "grad_norm": 2.0099422931671143,
1678
- "learning_rate": 2.4623081851964806e-05,
1679
- "loss": 0.7719,
1680
- "step": 1125
1681
- },
1682
- {
1683
- "epoch": 0.10044444444444445,
1684
- "grad_norm": 2.152926445007324,
1685
- "learning_rate": 2.4154655165898627e-05,
1686
- "loss": 0.8149,
1687
- "step": 1130
1688
- },
1689
- {
1690
- "epoch": 0.10088888888888889,
1691
- "grad_norm": 1.8284573554992676,
1692
- "learning_rate": 2.3689303817406515e-05,
1693
- "loss": 0.7305,
1694
- "step": 1135
1695
- },
1696
- {
1697
- "epoch": 0.10133333333333333,
1698
- "grad_norm": 2.114602565765381,
1699
- "learning_rate": 2.3227083178645313e-05,
1700
- "loss": 0.7345,
1701
- "step": 1140
1702
- },
1703
- {
1704
- "epoch": 0.10177777777777777,
1705
- "grad_norm": 2.3404438495635986,
1706
- "learning_rate": 2.276804824924864e-05,
1707
- "loss": 0.7555,
1708
- "step": 1145
1709
- },
1710
- {
1711
- "epoch": 0.10222222222222223,
1712
- "grad_norm": 2.388821601867676,
1713
- "learning_rate": 2.2312253649782655e-05,
1714
- "loss": 0.7898,
1715
- "step": 1150
1716
- },
1717
- {
1718
- "epoch": 0.10266666666666667,
1719
- "grad_norm": 2.1446168422698975,
1720
- "learning_rate": 2.185975361524657e-05,
1721
- "loss": 0.7312,
1722
- "step": 1155
1723
- },
1724
- {
1725
- "epoch": 0.10311111111111111,
1726
- "grad_norm": 2.198666572570801,
1727
- "learning_rate": 2.1410601988619394e-05,
1728
- "loss": 0.7612,
1729
- "step": 1160
1730
- },
1731
- {
1732
- "epoch": 0.10355555555555555,
1733
- "grad_norm": 2.017157793045044,
1734
- "learning_rate": 2.0964852214453013e-05,
1735
- "loss": 0.7532,
1736
- "step": 1165
1737
- },
1738
- {
1739
- "epoch": 0.104,
1740
- "grad_norm": 1.7391427755355835,
1741
- "learning_rate": 2.0522557332512953e-05,
1742
- "loss": 0.7452,
1743
- "step": 1170
1744
- },
1745
- {
1746
- "epoch": 0.10444444444444445,
1747
- "grad_norm": 2.3819758892059326,
1748
- "learning_rate": 2.008376997146705e-05,
1749
- "loss": 0.7735,
1750
- "step": 1175
1751
- },
1752
- {
1753
- "epoch": 0.10488888888888889,
1754
- "grad_norm": 2.148484945297241,
1755
- "learning_rate": 1.9648542342623277e-05,
1756
- "loss": 0.7528,
1757
- "step": 1180
1758
- },
1759
- {
1760
- "epoch": 0.10533333333333333,
1761
- "grad_norm": 2.3765158653259277,
1762
- "learning_rate": 1.9216926233717085e-05,
1763
- "loss": 0.7266,
1764
- "step": 1185
1765
- },
1766
- {
1767
- "epoch": 0.10577777777777778,
1768
- "grad_norm": 2.245584487915039,
1769
- "learning_rate": 1.8788973002749105e-05,
1770
- "loss": 0.7631,
1771
- "step": 1190
1772
- },
1773
- {
1774
- "epoch": 0.10622222222222222,
1775
- "grad_norm": 2.7776741981506348,
1776
- "learning_rate": 1.83647335718742e-05,
1777
- "loss": 0.8013,
1778
- "step": 1195
1779
- },
1780
- {
1781
- "epoch": 0.10666666666666667,
1782
- "grad_norm": 2.3494765758514404,
1783
- "learning_rate": 1.7944258421342098e-05,
1784
- "loss": 0.7146,
1785
- "step": 1200
1786
- },
1787
- {
1788
- "epoch": 0.10666666666666667,
1789
- "eval_loss": 0.8797385692596436,
1790
- "eval_runtime": 152.824,
1791
- "eval_samples_per_second": 32.717,
1792
- "eval_steps_per_second": 4.09,
1793
- "step": 1200
1794
- },
1795
- {
1796
- "epoch": 0.10711111111111112,
1797
- "grad_norm": 2.1948468685150146,
1798
- "learning_rate": 1.7527597583490822e-05,
1799
- "loss": 0.7383,
1800
- "step": 1205
1801
- },
1802
- {
1803
- "epoch": 0.10755555555555556,
1804
- "grad_norm": 2.3977882862091064,
1805
- "learning_rate": 1.7114800636793377e-05,
1806
- "loss": 0.7751,
1807
- "step": 1210
1808
- },
1809
- {
1810
- "epoch": 0.108,
1811
- "grad_norm": 2.380903482437134,
1812
- "learning_rate": 1.670591669995829e-05,
1813
- "loss": 0.7514,
1814
- "step": 1215
1815
- },
1816
- {
1817
- "epoch": 0.10844444444444444,
1818
- "grad_norm": 2.2962698936462402,
1819
- "learning_rate": 1.6300994426085103e-05,
1820
- "loss": 0.7014,
1821
- "step": 1220
1822
- },
1823
- {
1824
- "epoch": 0.10888888888888888,
1825
- "grad_norm": 2.4068901538848877,
1826
- "learning_rate": 1.5900081996875083e-05,
1827
- "loss": 0.7087,
1828
- "step": 1225
1829
- },
1830
- {
1831
- "epoch": 0.10933333333333334,
1832
- "grad_norm": 1.881907343864441,
1833
- "learning_rate": 1.5503227116898016e-05,
1834
- "loss": 0.7847,
1835
- "step": 1230
1836
- },
1837
- {
1838
- "epoch": 0.10977777777777778,
1839
- "grad_norm": 1.9586148262023926,
1840
- "learning_rate": 1.5110477007916001e-05,
1841
- "loss": 0.7493,
1842
- "step": 1235
1843
- },
1844
- {
1845
- "epoch": 0.11022222222222222,
1846
- "grad_norm": 2.418649911880493,
1847
- "learning_rate": 1.4721878403264345e-05,
1848
- "loss": 0.7384,
1849
- "step": 1240
1850
- },
1851
- {
1852
- "epoch": 0.11066666666666666,
1853
- "grad_norm": 1.8795591592788696,
1854
- "learning_rate": 1.4337477542290928e-05,
1855
- "loss": 0.7254,
1856
- "step": 1245
1857
- },
1858
- {
1859
- "epoch": 0.1111111111111111,
1860
- "grad_norm": 1.8943523168563843,
1861
- "learning_rate": 1.3957320164854059e-05,
1862
- "loss": 0.7512,
1863
- "step": 1250
1864
- },
1865
- {
1866
- "epoch": 0.11155555555555556,
1867
- "grad_norm": 1.794277548789978,
1868
- "learning_rate": 1.3581451505879994e-05,
1869
- "loss": 0.7525,
1870
- "step": 1255
1871
- },
1872
- {
1873
- "epoch": 0.112,
1874
- "grad_norm": 2.044266700744629,
1875
- "learning_rate": 1.3209916289980334e-05,
1876
- "loss": 0.7377,
1877
- "step": 1260
1878
- },
1879
- {
1880
- "epoch": 0.11244444444444444,
1881
- "grad_norm": 2.0691747665405273,
1882
- "learning_rate": 1.2842758726130283e-05,
1883
- "loss": 0.7573,
1884
- "step": 1265
1885
- },
1886
- {
1887
- "epoch": 0.11288888888888889,
1888
- "grad_norm": 1.927995204925537,
1889
- "learning_rate": 1.2480022502408307e-05,
1890
- "loss": 0.7135,
1891
- "step": 1270
1892
- },
1893
- {
1894
- "epoch": 0.11333333333333333,
1895
- "grad_norm": 2.154827356338501,
1896
- "learning_rate": 1.2121750780797513e-05,
1897
- "loss": 0.7531,
1898
- "step": 1275
1899
- },
1900
- {
1901
- "epoch": 0.11377777777777778,
1902
- "grad_norm": 2.528263807296753,
1903
- "learning_rate": 1.1767986192049984e-05,
1904
- "loss": 0.7425,
1905
- "step": 1280
1906
- },
1907
- {
1908
- "epoch": 0.11422222222222222,
1909
- "grad_norm": 2.024723529815674,
1910
- "learning_rate": 1.1418770830614013e-05,
1911
- "loss": 0.7483,
1912
- "step": 1285
1913
- },
1914
- {
1915
- "epoch": 0.11466666666666667,
1916
- "grad_norm": 2.312997817993164,
1917
- "learning_rate": 1.1074146249625333e-05,
1918
- "loss": 0.7718,
1919
- "step": 1290
1920
- },
1921
- {
1922
- "epoch": 0.11511111111111111,
1923
- "grad_norm": 2.0183262825012207,
1924
- "learning_rate": 1.0734153455962765e-05,
1925
- "loss": 0.7252,
1926
- "step": 1295
1927
- },
1928
- {
1929
- "epoch": 0.11555555555555555,
1930
- "grad_norm": 1.9232224225997925,
1931
- "learning_rate": 1.0398832905368694e-05,
1932
- "loss": 0.7424,
1933
- "step": 1300
1934
- },
1935
- {
1936
- "epoch": 0.11555555555555555,
1937
- "eval_loss": 0.8784002065658569,
1938
- "eval_runtime": 150.2141,
1939
- "eval_samples_per_second": 33.286,
1940
- "eval_steps_per_second": 4.161,
1941
- "step": 1300
1942
- },
1943
- {
1944
- "epoch": 0.116,
1945
- "grad_norm": 2.266510486602783,
1946
- "learning_rate": 1.006822449763537e-05,
1947
- "loss": 0.7199,
1948
- "step": 1305
1949
- },
1950
- {
1951
- "epoch": 0.11644444444444445,
1952
- "grad_norm": 2.1905033588409424,
1953
- "learning_rate": 9.742367571857091e-06,
1954
- "loss": 0.6834,
1955
- "step": 1310
1956
- },
1957
- {
1958
- "epoch": 0.11688888888888889,
1959
- "grad_norm": 2.1627376079559326,
1960
- "learning_rate": 9.421300901749386e-06,
1961
- "loss": 0.7759,
1962
- "step": 1315
1963
- },
1964
- {
1965
- "epoch": 0.11733333333333333,
1966
- "grad_norm": 1.9716633558273315,
1967
- "learning_rate": 9.105062691035233e-06,
1968
- "loss": 0.7466,
1969
- "step": 1320
1970
- },
1971
- {
1972
- "epoch": 0.11777777777777777,
1973
- "grad_norm": 2.4646966457366943,
1974
- "learning_rate": 8.793690568899216e-06,
1975
- "loss": 0.7428,
1976
- "step": 1325
1977
- },
1978
- {
1979
- "epoch": 0.11822222222222223,
1980
- "grad_norm": 1.8310861587524414,
1981
- "learning_rate": 8.487221585510074e-06,
1982
- "loss": 0.7042,
1983
- "step": 1330
1984
- },
1985
- {
1986
- "epoch": 0.11866666666666667,
1987
- "grad_norm": 2.5189576148986816,
1988
- "learning_rate": 8.185692207612022e-06,
1989
- "loss": 0.7686,
1990
- "step": 1335
1991
- },
1992
- {
1993
- "epoch": 0.11911111111111111,
1994
- "grad_norm": 2.2453572750091553,
1995
- "learning_rate": 7.889138314185678e-06,
1996
- "loss": 0.7635,
1997
- "step": 1340
1998
- },
1999
- {
2000
- "epoch": 0.11955555555555555,
2001
- "grad_norm": 2.0272557735443115,
2002
- "learning_rate": 7.597595192178702e-06,
2003
- "loss": 0.7305,
2004
- "step": 1345
2005
- },
2006
- {
2007
- "epoch": 0.12,
2008
- "grad_norm": 1.907797932624817,
2009
- "learning_rate": 7.311097532307121e-06,
2010
- "loss": 0.7326,
2011
- "step": 1350
2012
- },
2013
- {
2014
- "epoch": 0.12044444444444445,
2015
- "grad_norm": 2.3352434635162354,
2016
- "learning_rate": 7.029679424927365e-06,
2017
- "loss": 0.7466,
2018
- "step": 1355
2019
- },
2020
- {
2021
- "epoch": 0.12088888888888889,
2022
- "grad_norm": 2.1825642585754395,
2023
- "learning_rate": 6.753374355979975e-06,
2024
- "loss": 0.7346,
2025
- "step": 1360
2026
- },
2027
- {
2028
- "epoch": 0.12133333333333333,
2029
- "grad_norm": 2.175647020339966,
2030
- "learning_rate": 6.482215203005015e-06,
2031
- "loss": 0.716,
2032
- "step": 1365
2033
- },
2034
- {
2035
- "epoch": 0.12177777777777778,
2036
- "grad_norm": 2.214008092880249,
2037
- "learning_rate": 6.216234231230012e-06,
2038
- "loss": 0.7253,
2039
- "step": 1370
2040
- },
2041
- {
2042
- "epoch": 0.12222222222222222,
2043
- "grad_norm": 1.9471856355667114,
2044
- "learning_rate": 5.955463089730723e-06,
2045
- "loss": 0.7655,
2046
- "step": 1375
2047
- },
2048
- {
2049
- "epoch": 0.12266666666666666,
2050
- "grad_norm": 1.9830577373504639,
2051
- "learning_rate": 5.699932807665198e-06,
2052
- "loss": 0.7597,
2053
- "step": 1380
2054
- },
2055
- {
2056
- "epoch": 0.12311111111111112,
2057
- "grad_norm": 2.7330000400543213,
2058
- "learning_rate": 5.449673790581611e-06,
2059
- "loss": 0.7379,
2060
- "step": 1385
2061
- },
2062
- {
2063
- "epoch": 0.12355555555555556,
2064
- "grad_norm": 1.8489813804626465,
2065
- "learning_rate": 5.204715816800343e-06,
2066
- "loss": 0.7257,
2067
- "step": 1390
2068
- },
2069
- {
2070
- "epoch": 0.124,
2071
- "grad_norm": 1.673449158668518,
2072
- "learning_rate": 4.965088033870608e-06,
2073
- "loss": 0.7137,
2074
- "step": 1395
2075
- },
2076
- {
2077
- "epoch": 0.12444444444444444,
2078
- "grad_norm": 1.8715451955795288,
2079
- "learning_rate": 4.730818955102234e-06,
2080
- "loss": 0.7182,
2081
- "step": 1400
2082
- },
2083
- {
2084
- "epoch": 0.12444444444444444,
2085
- "eval_loss": 0.8741394281387329,
2086
- "eval_runtime": 146.2188,
2087
- "eval_samples_per_second": 34.195,
2088
- "eval_steps_per_second": 4.274,
2089
- "step": 1400
2090
- },
2091
- {
2092
- "epoch": 0.12488888888888888,
2093
- "grad_norm": 2.205672264099121,
2094
- "learning_rate": 4.501936456172845e-06,
2095
- "loss": 0.677,
2096
- "step": 1405
2097
- },
2098
- {
2099
- "epoch": 0.12533333333333332,
2100
- "grad_norm": 1.7708081007003784,
2101
- "learning_rate": 4.278467771810896e-06,
2102
- "loss": 0.7472,
2103
- "step": 1410
2104
- },
2105
- {
2106
- "epoch": 0.12577777777777777,
2107
- "grad_norm": 1.9440569877624512,
2108
- "learning_rate": 4.06043949255509e-06,
2109
- "loss": 0.7535,
2110
- "step": 1415
2111
- },
2112
- {
2113
- "epoch": 0.12622222222222224,
2114
- "grad_norm": 2.4499361515045166,
2115
- "learning_rate": 3.847877561590296e-06,
2116
- "loss": 0.7376,
2117
- "step": 1420
2118
- },
2119
- {
2120
- "epoch": 0.12666666666666668,
2121
- "grad_norm": 1.9922362565994263,
2122
- "learning_rate": 3.6408072716606346e-06,
2123
- "loss": 0.7311,
2124
- "step": 1425
2125
- },
2126
- {
2127
- "epoch": 0.12711111111111112,
2128
- "grad_norm": 2.0958635807037354,
2129
- "learning_rate": 3.4392532620598216e-06,
2130
- "loss": 0.7467,
2131
- "step": 1430
2132
- },
2133
- {
2134
- "epoch": 0.12755555555555556,
2135
- "grad_norm": 2.4406979084014893,
2136
- "learning_rate": 3.24323951569942e-06,
2137
- "loss": 0.6949,
2138
- "step": 1435
2139
- },
2140
- {
2141
- "epoch": 0.128,
2142
- "grad_norm": 2.1745223999023438,
2143
- "learning_rate": 3.052789356255037e-06,
2144
- "loss": 0.7868,
2145
- "step": 1440
2146
- },
2147
- {
2148
- "epoch": 0.12844444444444444,
2149
- "grad_norm": 1.7190449237823486,
2150
- "learning_rate": 2.8679254453910785e-06,
2151
- "loss": 0.7544,
2152
- "step": 1445
2153
- },
2154
- {
2155
- "epoch": 0.1288888888888889,
2156
- "grad_norm": 2.2487220764160156,
2157
- "learning_rate": 2.688669780064268e-06,
2158
- "loss": 0.795,
2159
- "step": 1450
2160
- },
2161
- {
2162
- "epoch": 0.12933333333333333,
2163
- "grad_norm": 1.7264000177383423,
2164
- "learning_rate": 2.515043689906149e-06,
2165
- "loss": 0.7619,
2166
- "step": 1455
2167
- },
2168
- {
2169
- "epoch": 0.12977777777777777,
2170
- "grad_norm": 2.1828534603118896,
2171
- "learning_rate": 2.3470678346851518e-06,
2172
- "loss": 0.7232,
2173
- "step": 1460
2174
- },
2175
- {
2176
- "epoch": 0.1302222222222222,
2177
- "grad_norm": 2.1059188842773438,
2178
- "learning_rate": 2.1847622018482283e-06,
2179
- "loss": 0.7275,
2180
- "step": 1465
2181
- },
2182
- {
2183
- "epoch": 0.13066666666666665,
2184
- "grad_norm": 2.2545006275177,
2185
- "learning_rate": 2.0281461041425807e-06,
2186
- "loss": 0.7381,
2187
- "step": 1470
2188
- },
2189
- {
2190
- "epoch": 0.13111111111111112,
2191
- "grad_norm": 1.883216142654419,
2192
- "learning_rate": 1.8772381773176417e-06,
2193
- "loss": 0.713,
2194
- "step": 1475
2195
- },
2196
- {
2197
- "epoch": 0.13155555555555556,
2198
- "grad_norm": 1.9266149997711182,
2199
- "learning_rate": 1.7320563779075593e-06,
2200
- "loss": 0.7198,
2201
- "step": 1480
2202
- },
2203
- {
2204
- "epoch": 0.132,
2205
- "grad_norm": 2.2115542888641357,
2206
- "learning_rate": 1.5926179810946184e-06,
2207
- "loss": 0.7582,
2208
- "step": 1485
2209
- },
2210
- {
2211
- "epoch": 0.13244444444444445,
2212
- "grad_norm": 2.4320266246795654,
2213
- "learning_rate": 1.4589395786535953e-06,
2214
- "loss": 0.7429,
2215
- "step": 1490
2216
- },
2217
- {
2218
- "epoch": 0.1328888888888889,
2219
- "grad_norm": 2.242762804031372,
2220
- "learning_rate": 1.331037076977576e-06,
2221
- "loss": 0.7461,
2222
- "step": 1495
2223
- },
2224
- {
2225
- "epoch": 0.13333333333333333,
2226
- "grad_norm": 1.9152480363845825,
2227
- "learning_rate": 1.2089256951851924e-06,
2228
- "loss": 0.7538,
2229
- "step": 1500
2230
- },
2231
- {
2232
- "epoch": 0.13333333333333333,
2233
- "eval_loss": 0.8730303645133972,
2234
- "eval_runtime": 146.7239,
2235
- "eval_samples_per_second": 34.078,
2236
- "eval_steps_per_second": 4.26,
2237
- "step": 1500
2238
- },
2239
- {
2240
- "epoch": 0.13377777777777777,
2241
- "grad_norm": 2.0257728099823,
2242
- "learning_rate": 1.0926199633097157e-06,
2243
- "loss": 0.7371,
2244
- "step": 1505
2245
- },
2246
- {
2247
- "epoch": 0.13422222222222221,
2248
- "grad_norm": 2.0611376762390137,
2249
- "learning_rate": 9.821337205701665e-07,
2250
- "loss": 0.7441,
2251
- "step": 1510
2252
- },
2253
- {
2254
- "epoch": 0.13466666666666666,
2255
- "grad_norm": 2.425452470779419,
2256
- "learning_rate": 8.774801137245159e-07,
2257
- "loss": 0.7061,
2258
- "step": 1515
2259
- },
2260
- {
2261
- "epoch": 0.1351111111111111,
2262
- "grad_norm": 2.0856432914733887,
2263
- "learning_rate": 7.786715955054203e-07,
2264
- "loss": 0.7179,
2265
- "step": 1520
2266
- },
2267
- {
2268
- "epoch": 0.13555555555555557,
2269
- "grad_norm": 2.4328227043151855,
2270
- "learning_rate": 6.857199231384282e-07,
2271
- "loss": 0.7216,
2272
- "step": 1525
2273
- },
2274
- {
2275
- "epoch": 0.136,
2276
- "grad_norm": 1.9803858995437622,
2277
- "learning_rate": 5.986361569430165e-07,
2278
- "loss": 0.7653,
2279
- "step": 1530
2280
- },
2281
- {
2282
- "epoch": 0.13644444444444445,
2283
- "grad_norm": 1.987241268157959,
2284
- "learning_rate": 5.174306590164879e-07,
2285
- "loss": 0.7252,
2286
- "step": 1535
2287
- },
2288
- {
2289
- "epoch": 0.1368888888888889,
2290
- "grad_norm": 2.019005537033081,
2291
- "learning_rate": 4.4211309200102303e-07,
2292
- "loss": 0.7449,
2293
- "step": 1540
2294
- },
2295
- {
2296
- "epoch": 0.13733333333333334,
2297
- "grad_norm": 1.9359219074249268,
2298
- "learning_rate": 3.7269241793390085e-07,
2299
- "loss": 0.7122,
2300
- "step": 1545
2301
- },
2302
- {
2303
- "epoch": 0.13777777777777778,
2304
- "grad_norm": 2.6825265884399414,
2305
- "learning_rate": 3.09176897181096e-07,
2306
- "loss": 0.7412,
2307
- "step": 1550
2308
- },
2309
- {
2310
- "epoch": 0.13822222222222222,
2311
- "grad_norm": 2.0021426677703857,
2312
- "learning_rate": 2.515740874544148e-07,
2313
- "loss": 0.7252,
2314
- "step": 1555
2315
- },
2316
- {
2317
- "epoch": 0.13866666666666666,
2318
- "grad_norm": 1.840452790260315,
2319
- "learning_rate": 1.9989084291216487e-07,
2320
- "loss": 0.7414,
2321
- "step": 1560
2322
- },
2323
- {
2324
- "epoch": 0.1391111111111111,
2325
- "grad_norm": 1.8142333030700684,
2326
- "learning_rate": 1.5413331334360182e-07,
2327
- "loss": 0.7031,
2328
- "step": 1565
2329
- },
2330
- {
2331
- "epoch": 0.13955555555555554,
2332
- "grad_norm": 2.1795592308044434,
2333
- "learning_rate": 1.1430694343715353e-07,
2334
- "loss": 0.7334,
2335
- "step": 1570
2336
- },
2337
- {
2338
- "epoch": 0.14,
2339
- "grad_norm": 2.239292621612549,
2340
- "learning_rate": 8.041647213256064e-08,
2341
- "loss": 0.7233,
2342
- "step": 1575
2343
- },
2344
- {
2345
- "epoch": 0.14044444444444446,
2346
- "grad_norm": 2.0367276668548584,
2347
- "learning_rate": 5.246593205699424e-08,
2348
- "loss": 0.7542,
2349
- "step": 1580
2350
- },
2351
- {
2352
- "epoch": 0.1408888888888889,
2353
- "grad_norm": 2.47322678565979,
2354
- "learning_rate": 3.04586490452119e-08,
2355
- "loss": 0.7508,
2356
- "step": 1585
2357
- },
2358
- {
2359
- "epoch": 0.14133333333333334,
2360
- "grad_norm": 2.6402623653411865,
2361
- "learning_rate": 1.4397241743813184e-08,
2362
- "loss": 0.7378,
2363
- "step": 1590
2364
- },
2365
- {
2366
- "epoch": 0.14177777777777778,
2367
- "grad_norm": 2.3272688388824463,
2368
- "learning_rate": 4.2836212996499865e-09,
2369
- "loss": 0.7607,
2370
- "step": 1595
2371
- },
2372
- {
2373
- "epoch": 0.14222222222222222,
2374
- "grad_norm": 2.066089630126953,
2375
- "learning_rate": 1.189911324084303e-10,
2376
- "loss": 0.7017,
2377
- "step": 1600
2378
- },
2379
- {
2380
- "epoch": 0.14222222222222222,
2381
- "eval_loss": 0.8723308444023132,
2382
- "eval_runtime": 146.0052,
2383
- "eval_samples_per_second": 34.245,
2384
- "eval_steps_per_second": 4.281,
2385
- "step": 1600
2386
  }
2387
  ],
2388
  "logging_steps": 5,
2389
- "max_steps": 1600,
2390
  "num_input_tokens_seen": 0,
2391
  "num_train_epochs": 1,
2392
- "save_steps": 100,
2393
  "stateful_callbacks": {
2394
  "TrainerControl": {
2395
  "args": {
@@ -2397,13 +1181,13 @@
2397
  "should_evaluate": false,
2398
  "should_log": false,
2399
  "should_save": true,
2400
- "should_training_stop": true
2401
  },
2402
  "attributes": {}
2403
  }
2404
  },
2405
- "total_flos": 347760947673600.0,
2406
- "train_batch_size": 1,
2407
  "trial_name": null,
2408
  "trial_params": null
2409
  }
 
1
  {
2
+ "best_global_step": 800,
3
+ "best_metric": 0.7418414950370789,
4
+ "best_model_checkpoint": "checkpoints/lora_uci/checkpoint-800",
5
+ "epoch": 0.28444444444444444,
6
+ "eval_steps": 200,
7
+ "global_step": 800,
8
  "is_hyper_param_search": false,
9
  "is_local_process_zero": true,
10
  "is_world_process_zero": true,
11
  "log_history": [
12
  {
13
+ "epoch": 0.00035555555555555557,
14
+ "grad_norm": 32.15512466430664,
15
  "learning_rate": 0.0,
16
+ "loss": 4.3308,
17
  "step": 1
18
  },
19
  {
20
+ "epoch": 0.0017777777777777779,
21
+ "grad_norm": 29.14179801940918,
22
+ "learning_rate": 1.6666666666666667e-06,
23
+ "loss": 4.2083,
24
  "step": 5
25
  },
26
  {
27
+ "epoch": 0.0035555555555555557,
28
+ "grad_norm": 22.27640724182129,
29
+ "learning_rate": 3.75e-06,
30
+ "loss": 3.5173,
31
  "step": 10
32
  },
33
  {
34
+ "epoch": 0.005333333333333333,
35
+ "grad_norm": 16.255199432373047,
36
+ "learning_rate": 5.833333333333334e-06,
37
+ "loss": 2.492,
38
  "step": 15
39
  },
40
  {
41
+ "epoch": 0.0071111111111111115,
42
+ "grad_norm": 13.185359954833984,
43
+ "learning_rate": 7.916666666666667e-06,
44
+ "loss": 1.6992,
45
  "step": 20
46
  },
47
  {
48
+ "epoch": 0.008888888888888889,
49
+ "grad_norm": 10.229533195495605,
50
+ "learning_rate": 1e-05,
51
+ "loss": 1.3879,
52
  "step": 25
53
  },
54
  {
55
+ "epoch": 0.010666666666666666,
56
+ "grad_norm": 12.396824836730957,
57
+ "learning_rate": 1.2083333333333333e-05,
58
+ "loss": 1.2205,
59
  "step": 30
60
  },
61
  {
62
+ "epoch": 0.012444444444444444,
63
+ "grad_norm": 34.271759033203125,
64
+ "learning_rate": 1.4166666666666668e-05,
65
+ "loss": 1.12,
66
  "step": 35
67
  },
68
  {
69
+ "epoch": 0.014222222222222223,
70
+ "grad_norm": 16.109539031982422,
71
+ "learning_rate": 1.6250000000000002e-05,
72
+ "loss": 1.1309,
73
  "step": 40
74
  },
75
  {
76
+ "epoch": 0.016,
77
+ "grad_norm": 14.91402816772461,
78
+ "learning_rate": 1.8333333333333333e-05,
79
+ "loss": 1.1121,
80
  "step": 45
81
  },
82
  {
83
+ "epoch": 0.017777777777777778,
84
+ "grad_norm": 11.38852310180664,
85
+ "learning_rate": 2.0416666666666667e-05,
86
+ "loss": 1.0329,
87
  "step": 50
88
  },
89
  {
90
+ "epoch": 0.019555555555555555,
91
+ "grad_norm": 13.237730979919434,
92
+ "learning_rate": 2.25e-05,
93
+ "loss": 1.0185,
94
  "step": 55
95
  },
96
  {
97
+ "epoch": 0.021333333333333333,
98
+ "grad_norm": 29.64104461669922,
99
+ "learning_rate": 2.4583333333333332e-05,
100
+ "loss": 1.012,
101
  "step": 60
102
  },
103
  {
104
+ "epoch": 0.02311111111111111,
105
+ "grad_norm": 9.474141120910645,
106
+ "learning_rate": 2.6666666666666667e-05,
107
+ "loss": 0.9874,
108
  "step": 65
109
  },
110
  {
111
+ "epoch": 0.024888888888888887,
112
+ "grad_norm": 8.440736770629883,
113
+ "learning_rate": 2.8749999999999997e-05,
114
+ "loss": 0.9604,
115
  "step": 70
116
  },
117
  {
118
+ "epoch": 0.02666666666666667,
119
+ "grad_norm": 6.6439900398254395,
120
+ "learning_rate": 3.0833333333333335e-05,
121
+ "loss": 0.9503,
122
  "step": 75
123
  },
124
  {
125
+ "epoch": 0.028444444444444446,
126
+ "grad_norm": 7.365396499633789,
127
+ "learning_rate": 3.291666666666667e-05,
128
+ "loss": 0.9339,
129
  "step": 80
130
  },
131
  {
132
+ "epoch": 0.030222222222222223,
133
+ "grad_norm": 8.67831802368164,
134
+ "learning_rate": 3.5e-05,
135
+ "loss": 0.919,
136
  "step": 85
137
  },
138
  {
139
+ "epoch": 0.032,
140
+ "grad_norm": 9.373591423034668,
141
+ "learning_rate": 3.708333333333334e-05,
142
+ "loss": 0.9177,
143
  "step": 90
144
  },
145
  {
146
+ "epoch": 0.033777777777777775,
147
+ "grad_norm": 6.998920440673828,
148
+ "learning_rate": 3.9166666666666665e-05,
149
+ "loss": 0.9053,
150
  "step": 95
151
  },
152
  {
153
+ "epoch": 0.035555555555555556,
154
+ "grad_norm": 7.322479248046875,
155
+ "learning_rate": 4.125e-05,
156
+ "loss": 0.9217,
 
 
 
 
 
 
 
 
157
  "step": 100
158
  },
159
  {
160
+ "epoch": 0.037333333333333336,
161
+ "grad_norm": 8.45313549041748,
162
+ "learning_rate": 4.3333333333333334e-05,
163
+ "loss": 0.9077,
164
  "step": 105
165
  },
166
  {
167
+ "epoch": 0.03911111111111111,
168
+ "grad_norm": 10.838536262512207,
169
+ "learning_rate": 4.541666666666667e-05,
170
+ "loss": 0.9066,
171
  "step": 110
172
  },
173
  {
174
+ "epoch": 0.04088888888888889,
175
+ "grad_norm": 9.282814979553223,
176
+ "learning_rate": 4.75e-05,
177
+ "loss": 0.8939,
178
  "step": 115
179
  },
180
  {
181
+ "epoch": 0.042666666666666665,
182
+ "grad_norm": 5.256754398345947,
183
+ "learning_rate": 4.958333333333334e-05,
184
+ "loss": 0.8755,
185
  "step": 120
186
  },
187
  {
188
+ "epoch": 0.044444444444444446,
189
+ "grad_norm": 6.6123552322387695,
190
+ "learning_rate": 5.166666666666667e-05,
191
+ "loss": 0.8843,
192
  "step": 125
193
  },
194
  {
195
+ "epoch": 0.04622222222222222,
196
+ "grad_norm": 6.317594528198242,
197
+ "learning_rate": 5.375e-05,
198
+ "loss": 0.8802,
199
  "step": 130
200
  },
201
  {
202
+ "epoch": 0.048,
203
+ "grad_norm": 4.420418739318848,
204
+ "learning_rate": 5.583333333333334e-05,
205
+ "loss": 0.8716,
206
  "step": 135
207
  },
208
  {
209
+ "epoch": 0.049777777777777775,
210
+ "grad_norm": 7.11093282699585,
211
+ "learning_rate": 5.7916666666666674e-05,
212
+ "loss": 0.8944,
213
  "step": 140
214
  },
215
  {
216
+ "epoch": 0.051555555555555556,
217
+ "grad_norm": 8.643278121948242,
218
+ "learning_rate": 6e-05,
219
+ "loss": 0.9049,
220
  "step": 145
221
  },
222
  {
223
+ "epoch": 0.05333333333333334,
224
+ "grad_norm": 5.504462718963623,
225
+ "learning_rate": 6.208333333333334e-05,
226
+ "loss": 0.9215,
227
  "step": 150
228
  },
229
  {
230
+ "epoch": 0.05511111111111111,
231
+ "grad_norm": 4.5625200271606445,
232
+ "learning_rate": 6.416666666666668e-05,
233
+ "loss": 0.8763,
234
  "step": 155
235
  },
236
  {
237
+ "epoch": 0.05688888888888889,
238
+ "grad_norm": 4.5830397605896,
239
+ "learning_rate": 6.625e-05,
240
+ "loss": 0.8967,
241
  "step": 160
242
  },
243
  {
244
+ "epoch": 0.058666666666666666,
245
+ "grad_norm": 5.370687961578369,
246
+ "learning_rate": 6.833333333333333e-05,
247
+ "loss": 0.868,
248
  "step": 165
249
  },
250
  {
251
+ "epoch": 0.060444444444444446,
252
+ "grad_norm": 8.188835144042969,
253
+ "learning_rate": 7.041666666666668e-05,
254
+ "loss": 0.8853,
255
  "step": 170
256
  },
257
  {
258
+ "epoch": 0.06222222222222222,
259
+ "grad_norm": 3.952087163925171,
260
+ "learning_rate": 7.25e-05,
261
+ "loss": 0.8724,
262
  "step": 175
263
  },
264
  {
265
+ "epoch": 0.064,
266
+ "grad_norm": 4.194353103637695,
267
+ "learning_rate": 7.458333333333333e-05,
268
+ "loss": 0.8581,
269
  "step": 180
270
  },
271
  {
272
+ "epoch": 0.06577777777777778,
273
+ "grad_norm": 2.985386610031128,
274
+ "learning_rate": 7.666666666666667e-05,
275
+ "loss": 0.8496,
276
  "step": 185
277
  },
278
  {
279
+ "epoch": 0.06755555555555555,
280
+ "grad_norm": 5.666004657745361,
281
+ "learning_rate": 7.875e-05,
282
+ "loss": 0.8816,
283
  "step": 190
284
  },
285
  {
286
+ "epoch": 0.06933333333333333,
287
+ "grad_norm": 3.95521879196167,
288
+ "learning_rate": 8.083333333333334e-05,
289
+ "loss": 0.8872,
290
  "step": 195
291
  },
292
  {
293
+ "epoch": 0.07111111111111111,
294
+ "grad_norm": 4.558910369873047,
295
+ "learning_rate": 8.291666666666667e-05,
296
+ "loss": 0.8802,
297
  "step": 200
298
  },
299
  {
300
+ "epoch": 0.07111111111111111,
301
+ "eval_loss": 0.8564087748527527,
302
+ "eval_runtime": 155.7786,
303
+ "eval_samples_per_second": 32.097,
304
+ "eval_steps_per_second": 4.012,
305
  "step": 200
306
  },
307
  {
308
+ "epoch": 0.07288888888888889,
309
+ "grad_norm": 2.4701411724090576,
310
+ "learning_rate": 8.5e-05,
311
+ "loss": 0.8383,
312
  "step": 205
313
  },
314
  {
315
+ "epoch": 0.07466666666666667,
316
+ "grad_norm": 4.364571571350098,
317
+ "learning_rate": 8.708333333333334e-05,
318
+ "loss": 0.8763,
319
  "step": 210
320
  },
321
  {
322
+ "epoch": 0.07644444444444444,
323
+ "grad_norm": 4.059802532196045,
324
+ "learning_rate": 8.916666666666667e-05,
325
+ "loss": 0.8928,
326
  "step": 215
327
  },
328
  {
329
+ "epoch": 0.07822222222222222,
330
+ "grad_norm": 7.405764579772949,
331
+ "learning_rate": 9.125e-05,
332
+ "loss": 0.8619,
333
  "step": 220
334
  },
335
  {
336
+ "epoch": 0.08,
337
+ "grad_norm": 4.007632732391357,
338
+ "learning_rate": 9.333333333333334e-05,
339
+ "loss": 0.9656,
340
  "step": 225
341
  },
342
  {
343
+ "epoch": 0.08177777777777778,
344
+ "grad_norm": 6.396026611328125,
345
+ "learning_rate": 9.541666666666668e-05,
346
+ "loss": 0.9084,
347
  "step": 230
348
  },
349
  {
350
+ "epoch": 0.08355555555555555,
351
+ "grad_norm": 4.630360126495361,
352
+ "learning_rate": 9.75e-05,
353
+ "loss": 0.8617,
354
  "step": 235
355
  },
356
  {
357
+ "epoch": 0.08533333333333333,
358
+ "grad_norm": 2.987304925918579,
359
+ "learning_rate": 9.958333333333335e-05,
360
+ "loss": 0.8696,
361
  "step": 240
362
  },
363
  {
364
+ "epoch": 0.08711111111111111,
365
+ "grad_norm": 3.981341600418091,
366
+ "learning_rate": 9.999915384288722e-05,
367
+ "loss": 0.8412,
368
  "step": 245
369
  },
370
  {
371
+ "epoch": 0.08888888888888889,
372
+ "grad_norm": 2.754917860031128,
373
+ "learning_rate": 9.999571637870036e-05,
374
+ "loss": 0.8526,
375
  "step": 250
376
  },
377
  {
378
+ "epoch": 0.09066666666666667,
379
+ "grad_norm": 2.6841213703155518,
380
+ "learning_rate": 9.998963490426943e-05,
381
+ "loss": 0.853,
382
  "step": 255
383
  },
384
  {
385
+ "epoch": 0.09244444444444444,
386
+ "grad_norm": 3.0342020988464355,
387
+ "learning_rate": 9.998090974121159e-05,
388
+ "loss": 0.8551,
389
  "step": 260
390
  },
391
  {
392
+ "epoch": 0.09422222222222222,
393
+ "grad_norm": 3.418090343475342,
394
+ "learning_rate": 9.99695413509548e-05,
395
+ "loss": 0.8315,
396
  "step": 265
397
  },
398
  {
399
+ "epoch": 0.096,
400
+ "grad_norm": 2.6049306392669678,
401
+ "learning_rate": 9.995553033471335e-05,
402
+ "loss": 0.8239,
403
  "step": 270
404
  },
405
  {
406
+ "epoch": 0.09777777777777778,
407
+ "grad_norm": 4.258927345275879,
408
+ "learning_rate": 9.993887743345614e-05,
409
+ "loss": 0.84,
410
  "step": 275
411
  },
412
  {
413
+ "epoch": 0.09955555555555555,
414
+ "grad_norm": 3.5856103897094727,
415
+ "learning_rate": 9.991958352786744e-05,
416
+ "loss": 0.8397,
417
  "step": 280
418
  },
419
  {
420
+ "epoch": 0.10133333333333333,
421
+ "grad_norm": 4.18107271194458,
422
+ "learning_rate": 9.989764963830037e-05,
423
+ "loss": 0.8283,
424
  "step": 285
425
  },
426
  {
427
+ "epoch": 0.10311111111111111,
428
+ "grad_norm": 2.9539637565612793,
429
+ "learning_rate": 9.987307692472287e-05,
430
+ "loss": 0.8315,
431
  "step": 290
432
  },
433
  {
434
+ "epoch": 0.10488888888888889,
435
+ "grad_norm": 2.6121134757995605,
436
+ "learning_rate": 9.98458666866564e-05,
437
+ "loss": 0.8203,
438
  "step": 295
439
  },
440
  {
441
+ "epoch": 0.10666666666666667,
442
+ "grad_norm": 3.4740283489227295,
443
+ "learning_rate": 9.98160203631072e-05,
444
+ "loss": 0.83,
 
 
 
 
 
 
 
 
445
  "step": 300
446
  },
447
  {
448
+ "epoch": 0.10844444444444444,
449
+ "grad_norm": 3.486816167831421,
450
+ "learning_rate": 9.978353953249022e-05,
451
+ "loss": 0.8269,
452
  "step": 305
453
  },
454
  {
455
+ "epoch": 0.11022222222222222,
456
+ "grad_norm": 2.7455246448516846,
457
+ "learning_rate": 9.974842591254558e-05,
458
+ "loss": 0.8332,
459
  "step": 310
460
  },
461
  {
462
+ "epoch": 0.112,
463
+ "grad_norm": 2.8629767894744873,
464
+ "learning_rate": 9.971068136024781e-05,
465
+ "loss": 0.8305,
466
  "step": 315
467
  },
468
  {
469
+ "epoch": 0.11377777777777778,
470
+ "grad_norm": 2.647754192352295,
471
+ "learning_rate": 9.967030787170757e-05,
472
+ "loss": 0.8213,
473
  "step": 320
474
  },
475
  {
476
+ "epoch": 0.11555555555555555,
477
+ "grad_norm": 2.873353958129883,
478
+ "learning_rate": 9.962730758206611e-05,
479
+ "loss": 0.8269,
480
  "step": 325
481
  },
482
  {
483
+ "epoch": 0.11733333333333333,
484
+ "grad_norm": 1.9501383304595947,
485
+ "learning_rate": 9.95816827653824e-05,
486
+ "loss": 0.8001,
487
  "step": 330
488
  },
489
  {
490
+ "epoch": 0.11911111111111111,
491
+ "grad_norm": 2.3588831424713135,
492
+ "learning_rate": 9.95334358345128e-05,
493
+ "loss": 0.7965,
494
  "step": 335
495
  },
496
  {
497
+ "epoch": 0.12088888888888889,
498
+ "grad_norm": 1.9669915437698364,
499
+ "learning_rate": 9.948256934098352e-05,
500
+ "loss": 0.7949,
501
  "step": 340
502
  },
503
  {
504
+ "epoch": 0.12266666666666666,
505
+ "grad_norm": 2.3287253379821777,
506
+ "learning_rate": 9.942908597485558e-05,
507
+ "loss": 0.8312,
508
  "step": 345
509
  },
510
  {
511
+ "epoch": 0.12444444444444444,
512
+ "grad_norm": 3.263697385787964,
513
+ "learning_rate": 9.93729885645827e-05,
514
+ "loss": 0.8305,
515
  "step": 350
516
  },
517
  {
518
+ "epoch": 0.12622222222222224,
519
+ "grad_norm": 1.87248694896698,
520
+ "learning_rate": 9.931428007686158e-05,
521
+ "loss": 0.8292,
522
  "step": 355
523
  },
524
  {
525
+ "epoch": 0.128,
526
+ "grad_norm": 2.7504541873931885,
527
+ "learning_rate": 9.925296361647504e-05,
528
+ "loss": 0.8285,
529
  "step": 360
530
  },
531
  {
532
+ "epoch": 0.12977777777777777,
533
+ "grad_norm": 2.169858694076538,
534
+ "learning_rate": 9.918904242612795e-05,
535
+ "loss": 0.8166,
536
  "step": 365
537
  },
538
  {
539
+ "epoch": 0.13155555555555556,
540
+ "grad_norm": 2.2024645805358887,
541
+ "learning_rate": 9.912251988627549e-05,
542
+ "loss": 0.7927,
543
  "step": 370
544
  },
545
  {
546
+ "epoch": 0.13333333333333333,
547
+ "grad_norm": 2.103611946105957,
548
+ "learning_rate": 9.905339951494463e-05,
549
+ "loss": 0.8236,
550
  "step": 375
551
  },
552
  {
553
+ "epoch": 0.1351111111111111,
554
+ "grad_norm": 2.265293836593628,
555
+ "learning_rate": 9.898168496754794e-05,
556
+ "loss": 0.7926,
557
  "step": 380
558
  },
559
  {
560
+ "epoch": 0.1368888888888889,
561
+ "grad_norm": 1.8098556995391846,
562
+ "learning_rate": 9.890738003669029e-05,
563
+ "loss": 0.812,
564
  "step": 385
565
  },
566
  {
567
+ "epoch": 0.13866666666666666,
568
+ "grad_norm": 1.6579197645187378,
569
+ "learning_rate": 9.88304886519683e-05,
570
+ "loss": 0.7933,
571
  "step": 390
572
  },
573
  {
574
+ "epoch": 0.14044444444444446,
575
+ "grad_norm": 1.664461612701416,
576
+ "learning_rate": 9.875101487976253e-05,
577
+ "loss": 0.798,
578
  "step": 395
579
  },
580
  {
581
+ "epoch": 0.14222222222222222,
582
+ "grad_norm": 1.6052963733673096,
583
+ "learning_rate": 9.866896292302243e-05,
584
+ "loss": 0.7937,
585
  "step": 400
586
  },
587
  {
588
+ "epoch": 0.14222222222222222,
589
+ "eval_loss": 0.791572093963623,
590
+ "eval_runtime": 159.2149,
591
+ "eval_samples_per_second": 31.404,
592
+ "eval_steps_per_second": 3.926,
593
  "step": 400
594
  },
595
  {
596
+ "epoch": 0.144,
597
+ "grad_norm": 2.126084566116333,
598
+ "learning_rate": 9.858433712104403e-05,
599
+ "loss": 0.8188,
600
  "step": 405
601
  },
602
  {
603
+ "epoch": 0.14577777777777778,
604
+ "grad_norm": 3.2941622734069824,
605
+ "learning_rate": 9.849714194924046e-05,
606
+ "loss": 0.8067,
607
  "step": 410
608
  },
609
  {
610
+ "epoch": 0.14755555555555555,
611
+ "grad_norm": 1.658234715461731,
612
+ "learning_rate": 9.84073820189054e-05,
613
+ "loss": 0.7953,
614
  "step": 415
615
  },
616
  {
617
+ "epoch": 0.14933333333333335,
618
+ "grad_norm": 2.6132164001464844,
619
+ "learning_rate": 9.831506207696898e-05,
620
+ "loss": 0.8044,
621
  "step": 420
622
  },
623
  {
624
+ "epoch": 0.1511111111111111,
625
+ "grad_norm": 1.6197243928909302,
626
+ "learning_rate": 9.822018700574695e-05,
627
+ "loss": 0.7818,
628
  "step": 425
629
  },
630
  {
631
+ "epoch": 0.15288888888888888,
632
+ "grad_norm": 2.1293976306915283,
633
+ "learning_rate": 9.812276182268236e-05,
634
+ "loss": 0.7796,
635
  "step": 430
636
  },
637
  {
638
+ "epoch": 0.15466666666666667,
639
+ "grad_norm": 2.590989589691162,
640
+ "learning_rate": 9.802279168008029e-05,
641
+ "loss": 0.7903,
642
  "step": 435
643
  },
644
  {
645
+ "epoch": 0.15644444444444444,
646
+ "grad_norm": 1.674521803855896,
647
+ "learning_rate": 9.792028186483526e-05,
648
+ "loss": 0.7772,
649
  "step": 440
650
  },
651
  {
652
+ "epoch": 0.1582222222222222,
653
+ "grad_norm": 2.3836069107055664,
654
+ "learning_rate": 9.781523779815179e-05,
655
+ "loss": 0.7934,
656
  "step": 445
657
  },
658
  {
659
+ "epoch": 0.16,
660
+ "grad_norm": 2.3944199085235596,
661
+ "learning_rate": 9.770766503525754e-05,
662
+ "loss": 0.7932,
663
  "step": 450
664
  },
665
  {
666
+ "epoch": 0.16177777777777777,
667
+ "grad_norm": 2.112563371658325,
668
+ "learning_rate": 9.759756926510965e-05,
669
+ "loss": 0.7873,
670
  "step": 455
671
  },
672
  {
673
+ "epoch": 0.16355555555555557,
674
+ "grad_norm": 2.041534185409546,
675
+ "learning_rate": 9.748495631009386e-05,
676
+ "loss": 0.796,
677
  "step": 460
678
  },
679
  {
680
+ "epoch": 0.16533333333333333,
681
+ "grad_norm": 1.7045772075653076,
682
+ "learning_rate": 9.736983212571646e-05,
683
+ "loss": 0.7791,
684
  "step": 465
685
  },
686
  {
687
+ "epoch": 0.1671111111111111,
688
+ "grad_norm": 1.5439887046813965,
689
+ "learning_rate": 9.725220280028957e-05,
690
+ "loss": 0.7939,
691
  "step": 470
692
  },
693
  {
694
+ "epoch": 0.1688888888888889,
695
+ "grad_norm": 1.459672451019287,
696
+ "learning_rate": 9.713207455460894e-05,
697
+ "loss": 0.7749,
698
  "step": 475
699
  },
700
  {
701
+ "epoch": 0.17066666666666666,
702
+ "grad_norm": 3.114187240600586,
703
+ "learning_rate": 9.700945374162506e-05,
704
+ "loss": 0.7785,
705
  "step": 480
706
  },
707
  {
708
+ "epoch": 0.17244444444444446,
709
+ "grad_norm": 1.7480342388153076,
710
+ "learning_rate": 9.688434684610726e-05,
711
+ "loss": 0.7653,
712
  "step": 485
713
  },
714
  {
715
+ "epoch": 0.17422222222222222,
716
+ "grad_norm": 1.854999303817749,
717
+ "learning_rate": 9.67567604843006e-05,
718
+ "loss": 0.7878,
719
  "step": 490
720
  },
721
  {
722
+ "epoch": 0.176,
723
+ "grad_norm": 2.006537437438965,
724
+ "learning_rate": 9.662670140357611e-05,
725
+ "loss": 0.7851,
726
  "step": 495
727
  },
728
  {
729
+ "epoch": 0.17777777777777778,
730
+ "grad_norm": 1.9404226541519165,
731
+ "learning_rate": 9.649417648207388e-05,
732
+ "loss": 0.7719,
 
 
 
 
 
 
 
 
733
  "step": 500
734
  },
735
  {
736
+ "epoch": 0.17955555555555555,
737
+ "grad_norm": 1.7404245138168335,
738
+ "learning_rate": 9.635919272833938e-05,
739
+ "loss": 0.775,
740
  "step": 505
741
  },
742
  {
743
+ "epoch": 0.18133333333333335,
744
+ "grad_norm": 1.4806632995605469,
745
+ "learning_rate": 9.622175728095271e-05,
746
+ "loss": 0.7822,
747
  "step": 510
748
  },
749
  {
750
+ "epoch": 0.1831111111111111,
751
+ "grad_norm": 1.607060432434082,
752
+ "learning_rate": 9.60818774081512e-05,
753
+ "loss": 0.7822,
754
  "step": 515
755
  },
756
  {
757
+ "epoch": 0.18488888888888888,
758
+ "grad_norm": 1.6430386304855347,
759
+ "learning_rate": 9.593956050744492e-05,
760
+ "loss": 0.7711,
761
  "step": 520
762
  },
763
  {
764
+ "epoch": 0.18666666666666668,
765
+ "grad_norm": 2.3202788829803467,
766
+ "learning_rate": 9.579481410522556e-05,
767
+ "loss": 0.7839,
768
  "step": 525
769
  },
770
  {
771
+ "epoch": 0.18844444444444444,
772
+ "grad_norm": 2.160609722137451,
773
+ "learning_rate": 9.564764585636833e-05,
774
+ "loss": 0.7854,
775
  "step": 530
776
  },
777
  {
778
+ "epoch": 0.1902222222222222,
779
+ "grad_norm": 1.7440357208251953,
780
+ "learning_rate": 9.549806354382717e-05,
781
+ "loss": 0.7806,
782
  "step": 535
783
  },
784
  {
785
+ "epoch": 0.192,
786
+ "grad_norm": 1.8481121063232422,
787
+ "learning_rate": 9.534607507822313e-05,
788
+ "loss": 0.7701,
789
  "step": 540
790
  },
791
  {
792
+ "epoch": 0.19377777777777777,
793
+ "grad_norm": 1.9447892904281616,
794
+ "learning_rate": 9.519168849742604e-05,
795
+ "loss": 0.772,
796
  "step": 545
797
  },
798
  {
799
+ "epoch": 0.19555555555555557,
800
+ "grad_norm": 2.9007174968719482,
801
+ "learning_rate": 9.503491196612939e-05,
802
+ "loss": 0.7486,
803
  "step": 550
804
  },
805
  {
806
+ "epoch": 0.19733333333333333,
807
+ "grad_norm": 1.5981870889663696,
808
+ "learning_rate": 9.487575377541864e-05,
809
+ "loss": 0.7713,
810
  "step": 555
811
  },
812
  {
813
+ "epoch": 0.1991111111111111,
814
+ "grad_norm": 1.6032360792160034,
815
+ "learning_rate": 9.471422234233259e-05,
816
+ "loss": 0.7596,
817
  "step": 560
818
  },
819
  {
820
+ "epoch": 0.2008888888888889,
821
+ "grad_norm": 1.7337145805358887,
822
+ "learning_rate": 9.45503262094184e-05,
823
+ "loss": 0.7725,
824
  "step": 565
825
  },
826
  {
827
+ "epoch": 0.20266666666666666,
828
+ "grad_norm": 1.7922943830490112,
829
+ "learning_rate": 9.438407404427971e-05,
830
+ "loss": 0.7646,
831
  "step": 570
832
  },
833
  {
834
+ "epoch": 0.20444444444444446,
835
+ "grad_norm": 1.2763404846191406,
836
+ "learning_rate": 9.421547463911835e-05,
837
+ "loss": 0.7744,
838
  "step": 575
839
  },
840
  {
841
+ "epoch": 0.20622222222222222,
842
+ "grad_norm": 1.6107685565948486,
843
+ "learning_rate": 9.404453691026929e-05,
844
+ "loss": 0.7854,
845
  "step": 580
846
  },
847
  {
848
+ "epoch": 0.208,
849
+ "grad_norm": 1.531690239906311,
850
+ "learning_rate": 9.38712698977291e-05,
851
+ "loss": 0.7765,
852
  "step": 585
853
  },
854
  {
855
+ "epoch": 0.20977777777777779,
856
+ "grad_norm": 4.303262710571289,
857
+ "learning_rate": 9.369568276467797e-05,
858
+ "loss": 0.7451,
859
  "step": 590
860
  },
861
  {
862
+ "epoch": 0.21155555555555555,
863
+ "grad_norm": 1.4451102018356323,
864
+ "learning_rate": 9.351778479699499e-05,
865
+ "loss": 0.767,
866
  "step": 595
867
  },
868
  {
869
+ "epoch": 0.21333333333333335,
870
+ "grad_norm": 1.9426709413528442,
871
+ "learning_rate": 9.333758540276716e-05,
872
+ "loss": 0.7611,
873
  "step": 600
874
  },
875
  {
876
+ "epoch": 0.21333333333333335,
877
+ "eval_loss": 0.7643172144889832,
878
+ "eval_runtime": 149.0272,
879
+ "eval_samples_per_second": 33.551,
880
+ "eval_steps_per_second": 4.194,
881
  "step": 600
882
  },
883
  {
884
+ "epoch": 0.21511111111111111,
885
+ "grad_norm": 1.740432858467102,
886
+ "learning_rate": 9.315509411179182e-05,
887
+ "loss": 0.763,
888
  "step": 605
889
  },
890
  {
891
+ "epoch": 0.21688888888888888,
892
+ "grad_norm": 2.1427745819091797,
893
+ "learning_rate": 9.297032057507264e-05,
894
+ "loss": 0.7717,
895
  "step": 610
896
  },
897
  {
898
+ "epoch": 0.21866666666666668,
899
+ "grad_norm": 2.2643210887908936,
900
+ "learning_rate": 9.278327456430926e-05,
901
+ "loss": 0.7917,
902
  "step": 615
903
  },
904
  {
905
+ "epoch": 0.22044444444444444,
906
+ "grad_norm": 1.9076204299926758,
907
+ "learning_rate": 9.259396597138052e-05,
908
+ "loss": 0.7637,
909
  "step": 620
910
  },
911
  {
912
+ "epoch": 0.2222222222222222,
913
+ "grad_norm": 1.2893517017364502,
914
+ "learning_rate": 9.24024048078213e-05,
915
+ "loss": 0.7435,
916
  "step": 625
917
  },
918
  {
919
+ "epoch": 0.224,
920
+ "grad_norm": 1.9294421672821045,
921
+ "learning_rate": 9.22086012042931e-05,
922
+ "loss": 0.7446,
923
  "step": 630
924
  },
925
  {
926
+ "epoch": 0.22577777777777777,
927
+ "grad_norm": 1.516155481338501,
928
+ "learning_rate": 9.201256541004829e-05,
929
+ "loss": 0.7608,
930
  "step": 635
931
  },
932
  {
933
+ "epoch": 0.22755555555555557,
934
+ "grad_norm": 1.4733030796051025,
935
+ "learning_rate": 9.181430779238797e-05,
936
+ "loss": 0.7708,
937
  "step": 640
938
  },
939
  {
940
+ "epoch": 0.22933333333333333,
941
+ "grad_norm": 1.7201124429702759,
942
+ "learning_rate": 9.16138388361139e-05,
943
+ "loss": 0.7475,
944
  "step": 645
945
  },
946
  {
947
+ "epoch": 0.2311111111111111,
948
+ "grad_norm": 1.573810338973999,
949
+ "learning_rate": 9.141116914297378e-05,
950
+ "loss": 0.7782,
951
  "step": 650
952
  },
953
  {
954
+ "epoch": 0.2328888888888889,
955
+ "grad_norm": 1.3574259281158447,
956
+ "learning_rate": 9.120630943110077e-05,
957
+ "loss": 0.7406,
958
  "step": 655
959
  },
960
  {
961
+ "epoch": 0.23466666666666666,
962
+ "grad_norm": 1.912653923034668,
963
+ "learning_rate": 9.099927053444662e-05,
964
+ "loss": 0.7462,
965
  "step": 660
966
  },
967
  {
968
+ "epoch": 0.23644444444444446,
969
+ "grad_norm": 2.2716944217681885,
970
+ "learning_rate": 9.079006340220862e-05,
971
+ "loss": 0.7526,
972
  "step": 665
973
  },
974
  {
975
+ "epoch": 0.23822222222222222,
976
+ "grad_norm": 1.625752329826355,
977
+ "learning_rate": 9.057869909825062e-05,
978
+ "loss": 0.762,
979
  "step": 670
980
  },
981
  {
982
+ "epoch": 0.24,
983
+ "grad_norm": 1.2941769361495972,
984
+ "learning_rate": 9.0365188800518e-05,
985
+ "loss": 0.7544,
986
  "step": 675
987
  },
988
  {
989
+ "epoch": 0.24177777777777779,
990
+ "grad_norm": 1.6075447797775269,
991
+ "learning_rate": 9.01495438004464e-05,
992
+ "loss": 0.7639,
993
  "step": 680
994
  },
995
  {
996
+ "epoch": 0.24355555555555555,
997
+ "grad_norm": 1.671107292175293,
998
+ "learning_rate": 8.993177550236464e-05,
999
+ "loss": 0.7567,
1000
  "step": 685
1001
  },
1002
  {
1003
+ "epoch": 0.24533333333333332,
1004
+ "grad_norm": 1.3904489278793335,
1005
+ "learning_rate": 8.971189542289162e-05,
1006
+ "loss": 0.7633,
1007
  "step": 690
1008
  },
1009
  {
1010
+ "epoch": 0.24711111111111111,
1011
+ "grad_norm": 1.997207760810852,
1012
+ "learning_rate": 8.948991519032716e-05,
1013
+ "loss": 0.7403,
1014
  "step": 695
1015
  },
1016
  {
1017
+ "epoch": 0.24888888888888888,
1018
+ "grad_norm": 1.5448390245437622,
1019
+ "learning_rate": 8.926584654403724e-05,
1020
+ "loss": 0.7424,
 
 
 
 
 
 
 
 
1021
  "step": 700
1022
  },
1023
  {
1024
+ "epoch": 0.25066666666666665,
1025
+ "grad_norm": 1.760542392730713,
1026
+ "learning_rate": 8.903970133383297e-05,
1027
+ "loss": 0.7436,
1028
  "step": 705
1029
  },
1030
  {
1031
+ "epoch": 0.25244444444444447,
1032
+ "grad_norm": 1.6473764181137085,
1033
+ "learning_rate": 8.881149151934398e-05,
1034
+ "loss": 0.7569,
1035
  "step": 710
1036
  },
1037
  {
1038
+ "epoch": 0.25422222222222224,
1039
+ "grad_norm": 1.2679284811019897,
1040
+ "learning_rate": 8.858122916938601e-05,
1041
+ "loss": 0.7556,
1042
  "step": 715
1043
  },
1044
  {
1045
+ "epoch": 0.256,
1046
+ "grad_norm": 1.3798352479934692,
1047
+ "learning_rate": 8.834892646132254e-05,
1048
+ "loss": 0.7446,
1049
  "step": 720
1050
  },
1051
  {
1052
+ "epoch": 0.2577777777777778,
1053
+ "grad_norm": 1.692984700202942,
1054
+ "learning_rate": 8.811459568042091e-05,
1055
+ "loss": 0.7695,
1056
  "step": 725
1057
  },
1058
  {
1059
+ "epoch": 0.25955555555555554,
1060
+ "grad_norm": 1.6680200099945068,
1061
+ "learning_rate": 8.787824921920249e-05,
1062
+ "loss": 0.7462,
1063
  "step": 730
1064
  },
1065
  {
1066
+ "epoch": 0.2613333333333333,
1067
+ "grad_norm": 1.4423996210098267,
1068
+ "learning_rate": 8.763989957678742e-05,
1069
+ "loss": 0.7637,
1070
  "step": 735
1071
  },
1072
  {
1073
+ "epoch": 0.26311111111111113,
1074
+ "grad_norm": 1.6775215864181519,
1075
+ "learning_rate": 8.739955935823351e-05,
1076
+ "loss": 0.755,
1077
  "step": 740
1078
  },
1079
  {
1080
+ "epoch": 0.2648888888888889,
1081
+ "grad_norm": 1.6499780416488647,
1082
+ "learning_rate": 8.715724127386972e-05,
1083
+ "loss": 0.7468,
1084
  "step": 745
1085
  },
1086
  {
1087
+ "epoch": 0.26666666666666666,
1088
+ "grad_norm": 1.3906657695770264,
1089
+ "learning_rate": 8.691295813862386e-05,
1090
+ "loss": 0.7458,
1091
  "step": 750
1092
  },
1093
  {
1094
+ "epoch": 0.26844444444444443,
1095
+ "grad_norm": 1.3843157291412354,
1096
+ "learning_rate": 8.666672287134494e-05,
1097
+ "loss": 0.7461,
1098
  "step": 755
1099
  },
1100
  {
1101
+ "epoch": 0.2702222222222222,
1102
+ "grad_norm": 1.6778830289840698,
1103
+ "learning_rate": 8.641854849412001e-05,
1104
+ "loss": 0.7284,
1105
  "step": 760
1106
  },
1107
  {
1108
+ "epoch": 0.272,
1109
+ "grad_norm": 1.2424935102462769,
1110
+ "learning_rate": 8.61684481315854e-05,
1111
+ "loss": 0.7774,
1112
  "step": 765
1113
  },
1114
  {
1115
+ "epoch": 0.2737777777777778,
1116
+ "grad_norm": 1.291231393814087,
1117
+ "learning_rate": 8.591643501023265e-05,
1118
+ "loss": 0.7428,
1119
  "step": 770
1120
  },
1121
  {
1122
+ "epoch": 0.27555555555555555,
1123
+ "grad_norm": 1.12603759765625,
1124
+ "learning_rate": 8.566252245770909e-05,
1125
+ "loss": 0.7377,
1126
  "step": 775
1127
  },
1128
  {
1129
+ "epoch": 0.2773333333333333,
1130
+ "grad_norm": 1.348775029182434,
1131
+ "learning_rate": 8.54067239021129e-05,
1132
+ "loss": 0.762,
1133
  "step": 780
1134
  },
1135
  {
1136
+ "epoch": 0.2791111111111111,
1137
+ "grad_norm": 1.703519582748413,
1138
+ "learning_rate": 8.51490528712831e-05,
1139
+ "loss": 0.7533,
1140
  "step": 785
1141
  },
1142
  {
1143
+ "epoch": 0.2808888888888889,
1144
+ "grad_norm": 1.8479610681533813,
1145
+ "learning_rate": 8.488952299208401e-05,
1146
+ "loss": 0.7535,
1147
  "step": 790
1148
  },
1149
  {
1150
+ "epoch": 0.2826666666666667,
1151
+ "grad_norm": 1.1511187553405762,
1152
+ "learning_rate": 8.462814798968472e-05,
1153
+ "loss": 0.7555,
1154
  "step": 795
1155
  },
1156
  {
1157
+ "epoch": 0.28444444444444444,
1158
+ "grad_norm": 2.5522501468658447,
1159
+ "learning_rate": 8.43649416868331e-05,
1160
+ "loss": 0.741,
1161
  "step": 800
1162
  },
1163
  {
1164
+ "epoch": 0.28444444444444444,
1165
+ "eval_loss": 0.7418414950370789,
1166
+ "eval_runtime": 169.2643,
1167
+ "eval_samples_per_second": 29.54,
1168
+ "eval_steps_per_second": 3.692,
1169
  "step": 800
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1170
  }
1171
  ],
1172
  "logging_steps": 5,
1173
+ "max_steps": 2400,
1174
  "num_input_tokens_seen": 0,
1175
  "num_train_epochs": 1,
1176
+ "save_steps": 200,
1177
  "stateful_callbacks": {
1178
  "TrainerControl": {
1179
  "args": {
 
1181
  "should_evaluate": false,
1182
  "should_log": false,
1183
  "should_save": true,
1184
+ "should_training_stop": false
1185
  },
1186
  "attributes": {}
1187
  }
1188
  },
1189
+ "total_flos": 775521753600000.0,
1190
+ "train_batch_size": 4,
1191
  "trial_name": null,
1192
  "trial_params": null
1193
  }
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1f301dabc963ac8802177bfd738213a0cc9f22b48633c155ad395ae77124e6c7
3
- size 5368
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e74b0f4621a0feddce935bce6008dfc021ab4f3f6753b47ccab1c7dbe33fc776
3
+ size 5841