RayDu0010 commited on Jun 26, 2025

Commit

85207e4

verified ·

1 Parent(s): ede7412

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

13_128_e5_3e-5/checkpoint-1256/README.md +202 -0
13_128_e5_3e-5/checkpoint-1256/adapter_config.json +39 -0
13_128_e5_3e-5/checkpoint-1256/adapter_model.safetensors +3 -0
13_128_e5_3e-5/checkpoint-1256/latest +1 -0
13_128_e5_3e-5/checkpoint-1256/merges.txt +0 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_0.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_1.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_2.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_3.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_4.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_5.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_6.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/rng_state_7.pth +3 -0
13_128_e5_3e-5/checkpoint-1256/scheduler.pt +3 -0
13_128_e5_3e-5/checkpoint-1256/special_tokens_map.json +45 -0
13_128_e5_3e-5/checkpoint-1256/tokenizer.json +0 -0
13_128_e5_3e-5/checkpoint-1256/tokenizer_config.json +188 -0
13_128_e5_3e-5/checkpoint-1256/trainer_state.json +1791 -0
13_128_e5_3e-5/checkpoint-1256/training_args.bin +3 -0
13_128_e5_3e-5/checkpoint-1256/vocab.json +0 -0
13_128_e5_3e-5/checkpoint-1256/zero_to_fp32.py +604 -0
13_128_e5_3e-5/checkpoint-1570/README.md +202 -0
13_128_e5_3e-5/checkpoint-1570/adapter_config.json +39 -0
13_128_e5_3e-5/checkpoint-1570/adapter_model.safetensors +3 -0
13_128_e5_3e-5/checkpoint-1570/latest +1 -0
13_128_e5_3e-5/checkpoint-1570/merges.txt +0 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_0.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_1.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_2.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_3.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_4.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_5.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_6.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/rng_state_7.pth +3 -0
13_128_e5_3e-5/checkpoint-1570/scheduler.pt +3 -0
13_128_e5_3e-5/checkpoint-1570/special_tokens_map.json +45 -0
13_128_e5_3e-5/checkpoint-1570/tokenizer.json +0 -0
13_128_e5_3e-5/checkpoint-1570/tokenizer_config.json +188 -0
13_128_e5_3e-5/checkpoint-1570/trainer_state.json +2232 -0
13_128_e5_3e-5/checkpoint-1570/training_args.bin +3 -0
13_128_e5_3e-5/checkpoint-1570/vocab.json +0 -0
13_128_e5_3e-5/checkpoint-1570/zero_to_fp32.py +604 -0
13_128_e5_3e-5/checkpoint-314/README.md +202 -0
13_128_e5_3e-5/checkpoint-314/adapter_config.json +39 -0
13_128_e5_3e-5/checkpoint-314/adapter_model.safetensors +3 -0
13_128_e5_3e-5/checkpoint-314/latest +1 -0
13_128_e5_3e-5/checkpoint-314/merges.txt +0 -0
13_128_e5_3e-5/checkpoint-314/rng_state_0.pth +3 -0
13_128_e5_3e-5/checkpoint-314/rng_state_1.pth +3 -0
13_128_e5_3e-5/checkpoint-314/rng_state_2.pth +3 -0

13_128_e5_3e-5/checkpoint-1256/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

13_128_e5_3e-5/checkpoint-1256/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "down_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "k_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

13_128_e5_3e-5/checkpoint-1256/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f69f1251a6205e7646564fd6c7daa64e137e9697d6cd74900bb0a7a9173a5232
+size 791751704

13_128_e5_3e-5/checkpoint-1256/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1256

13_128_e5_3e-5/checkpoint-1256/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1256/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:413b481583a59b743511fd1d84944abefc89887a7da15df755048d4f18611641
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d8beea103599b16c41e7b57a89c41138688176e04a09a1985576c70fd16151c6
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2933f3f002c279da8c3ef4db1155dbbee003d99ade851854f574e9ad964f2ab2
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25996721132d97075cb3e3780d494af34aed744206a630113d48d84bf645d4b2
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:63e79c7204f2ceaffa6475f10e50b9431e9d0ec3ce891b13f72b72e5946176f8
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74adbcfb9bae4dcdb62c59a062b6799d9d398e4bd805434309d903675661fcc1
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1dcd90c7a7fcb6a0e97ab4ffa23a85718bcb9a720e59408c3dc5136fb97c694
+size 15920

13_128_e5_3e-5/checkpoint-1256/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:30748ffefdb407c2f9fd8e21f8cf5a6ed8bc8c3a8378e41947547b6ded55ac0e
+size 15920

13_128_e5_3e-5/checkpoint-1256/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:42ef90eb11168bd05ef1bc84cec3fb2e52a6c4b99567398d6b03c18bc0cea696
+size 1064

13_128_e5_3e-5/checkpoint-1256/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

13_128_e5_3e-5/checkpoint-1256/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1256/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

13_128_e5_3e-5/checkpoint-1256/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1791 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "eval_steps": 500,
+  "global_step": 1256,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.01594896331738437,
+      "grad_norm": 1.6145861148834229,
+      "learning_rate": 1.518987341772152e-06,
+      "loss": 1.43,
+      "step": 5
+    },
+    {
+      "epoch": 0.03189792663476874,
+      "grad_norm": 1.0209437608718872,
+      "learning_rate": 3.4177215189873417e-06,
+      "loss": 1.3954,
+      "step": 10
+    },
+    {
+      "epoch": 0.04784688995215311,
+      "grad_norm": 0.7023665308952332,
+      "learning_rate": 5.3164556962025316e-06,
+      "loss": 1.412,
+      "step": 15
+    },
+    {
+      "epoch": 0.06379585326953748,
+      "grad_norm": 0.6941770911216736,
+      "learning_rate": 7.215189873417722e-06,
+      "loss": 1.3573,
+      "step": 20
+    },
+    {
+      "epoch": 0.07974481658692185,
+      "grad_norm": 0.5891399383544922,
+      "learning_rate": 9.113924050632912e-06,
+      "loss": 1.3281,
+      "step": 25
+    },
+    {
+      "epoch": 0.09569377990430622,
+      "grad_norm": 0.6663492321968079,
+      "learning_rate": 1.1012658227848103e-05,
+      "loss": 1.3214,
+      "step": 30
+    },
+    {
+      "epoch": 0.11164274322169059,
+      "grad_norm": 0.4784772992134094,
+      "learning_rate": 1.2911392405063291e-05,
+      "loss": 1.2544,
+      "step": 35
+    },
+    {
+      "epoch": 0.12759170653907495,
+      "grad_norm": 0.5382697582244873,
+      "learning_rate": 1.4810126582278482e-05,
+      "loss": 1.2353,
+      "step": 40
+    },
+    {
+      "epoch": 0.14354066985645933,
+      "grad_norm": 0.47352251410484314,
+      "learning_rate": 1.670886075949367e-05,
+      "loss": 1.2898,
+      "step": 45
+    },
+    {
+      "epoch": 0.1594896331738437,
+      "grad_norm": 0.4318561851978302,
+      "learning_rate": 1.860759493670886e-05,
+      "loss": 1.2872,
+      "step": 50
+    },
+    {
+      "epoch": 0.17543859649122806,
+      "grad_norm": 0.4748729467391968,
+      "learning_rate": 2.050632911392405e-05,
+      "loss": 1.2466,
+      "step": 55
+    },
+    {
+      "epoch": 0.19138755980861244,
+      "grad_norm": 0.4512109160423279,
+      "learning_rate": 2.240506329113924e-05,
+      "loss": 1.2634,
+      "step": 60
+    },
+    {
+      "epoch": 0.20733652312599682,
+      "grad_norm": 0.4634909927845001,
+      "learning_rate": 2.430379746835443e-05,
+      "loss": 1.2455,
+      "step": 65
+    },
+    {
+      "epoch": 0.22328548644338117,
+      "grad_norm": 0.5010442733764648,
+      "learning_rate": 2.620253164556962e-05,
+      "loss": 1.192,
+      "step": 70
+    },
+    {
+      "epoch": 0.23923444976076555,
+      "grad_norm": 0.581072211265564,
+      "learning_rate": 2.8101265822784812e-05,
+      "loss": 1.1886,
+      "step": 75
+    },
+    {
+      "epoch": 0.2551834130781499,
+      "grad_norm": 0.4782950282096863,
+      "learning_rate": 3e-05,
+      "loss": 1.1987,
+      "step": 80
+    },
+    {
+      "epoch": 0.2711323763955343,
+      "grad_norm": 0.583503782749176,
+      "learning_rate": 2.999916758151899e-05,
+      "loss": 1.1802,
+      "step": 85
+    },
+    {
+      "epoch": 0.28708133971291866,
+      "grad_norm": 0.48859143257141113,
+      "learning_rate": 2.999667041846535e-05,
+      "loss": 1.1947,
+      "step": 90
+    },
+    {
+      "epoch": 0.30303030303030304,
+      "grad_norm": 0.6850916743278503,
+      "learning_rate": 2.9992508787997044e-05,
+      "loss": 1.1564,
+      "step": 95
+    },
+    {
+      "epoch": 0.3189792663476874,
+      "grad_norm": 0.5491988658905029,
+      "learning_rate": 2.9986683152009822e-05,
+      "loss": 1.1004,
+      "step": 100
+    },
+    {
+      "epoch": 0.3349282296650718,
+      "grad_norm": 0.7356356382369995,
+      "learning_rate": 2.997919415708596e-05,
+      "loss": 1.0898,
+      "step": 105
+    },
+    {
+      "epoch": 0.3508771929824561,
+      "grad_norm": 0.5359194874763489,
+      "learning_rate": 2.9970042634422484e-05,
+      "loss": 1.1407,
+      "step": 110
+    },
+    {
+      "epoch": 0.3668261562998405,
+      "grad_norm": 0.7027154564857483,
+      "learning_rate": 2.995922959973895e-05,
+      "loss": 1.0929,
+      "step": 115
+    },
+    {
+      "epoch": 0.3827751196172249,
+      "grad_norm": 0.6330419182777405,
+      "learning_rate": 2.994675625316468e-05,
+      "loss": 1.0657,
+      "step": 120
+    },
+    {
+      "epoch": 0.39872408293460926,
+      "grad_norm": 0.6170833110809326,
+      "learning_rate": 2.9932623979105558e-05,
+      "loss": 1.0718,
+      "step": 125
+    },
+    {
+      "epoch": 0.41467304625199364,
+      "grad_norm": 0.6576064229011536,
+      "learning_rate": 2.9916834346090406e-05,
+      "loss": 1.0926,
+      "step": 130
+    },
+    {
+      "epoch": 0.430622009569378,
+      "grad_norm": 0.5628703832626343,
+      "learning_rate": 2.9899389106596867e-05,
+      "loss": 0.9873,
+      "step": 135
+    },
+    {
+      "epoch": 0.44657097288676234,
+      "grad_norm": 0.6521088480949402,
+      "learning_rate": 2.9880290196856913e-05,
+      "loss": 1.0647,
+      "step": 140
+    },
+    {
+      "epoch": 0.4625199362041467,
+      "grad_norm": 0.7059735059738159,
+      "learning_rate": 2.9859539736641926e-05,
+      "loss": 1.0348,
+      "step": 145
+    },
+    {
+      "epoch": 0.4784688995215311,
+      "grad_norm": 0.6659881472587585,
+      "learning_rate": 2.983714002902746e-05,
+      "loss": 1.035,
+      "step": 150
+    },
+    {
+      "epoch": 0.4944178628389155,
+      "grad_norm": 0.6957710981369019,
+      "learning_rate": 2.9813093560137577e-05,
+      "loss": 1.0298,
+      "step": 155
+    },
+    {
+      "epoch": 0.5103668261562998,
+      "grad_norm": 0.7887131571769714,
+      "learning_rate": 2.978740299886898e-05,
+      "loss": 0.9958,
+      "step": 160
+    },
+    {
+      "epoch": 0.5263157894736842,
+      "grad_norm": 0.8920831084251404,
+      "learning_rate": 2.9760071196594715e-05,
+      "loss": 1.0141,
+      "step": 165
+    },
+    {
+      "epoch": 0.5422647527910686,
+      "grad_norm": 0.9099592566490173,
+      "learning_rate": 2.973110118684777e-05,
+      "loss": 0.9644,
+      "step": 170
+    },
+    {
+      "epoch": 0.5582137161084529,
+      "grad_norm": 0.9702492356300354,
+      "learning_rate": 2.970049618498434e-05,
+      "loss": 0.9158,
+      "step": 175
+    },
+    {
+      "epoch": 0.5741626794258373,
+      "grad_norm": 0.8628178834915161,
+      "learning_rate": 2.9668259587826984e-05,
+      "loss": 0.934,
+      "step": 180
+    },
+    {
+      "epoch": 0.5901116427432217,
+      "grad_norm": 0.7894458174705505,
+      "learning_rate": 2.9634394973287605e-05,
+      "loss": 0.9344,
+      "step": 185
+    },
+    {
+      "epoch": 0.6060606060606061,
+      "grad_norm": 0.7742944955825806,
+      "learning_rate": 2.9598906099970324e-05,
+      "loss": 0.9333,
+      "step": 190
+    },
+    {
+      "epoch": 0.6220095693779905,
+      "grad_norm": 0.8715435266494751,
+      "learning_rate": 2.956179690675435e-05,
+      "loss": 0.8983,
+      "step": 195
+    },
+    {
+      "epoch": 0.6379585326953748,
+      "grad_norm": 0.9087679386138916,
+      "learning_rate": 2.9523071512356785e-05,
+      "loss": 0.8283,
+      "step": 200
+    },
+    {
+      "epoch": 0.6539074960127592,
+      "grad_norm": 0.8821738958358765,
+      "learning_rate": 2.9482734214875492e-05,
+      "loss": 0.9741,
+      "step": 205
+    },
+    {
+      "epoch": 0.6698564593301436,
+      "grad_norm": 0.8856318593025208,
+      "learning_rate": 2.9440789491312053e-05,
+      "loss": 0.8902,
+      "step": 210
+    },
+    {
+      "epoch": 0.6858054226475279,
+      "grad_norm": 0.9013231992721558,
+      "learning_rate": 2.9397241997074885e-05,
+      "loss": 0.9095,
+      "step": 215
+    },
+    {
+      "epoch": 0.7017543859649122,
+      "grad_norm": 0.9037012457847595,
+      "learning_rate": 2.9352096565462518e-05,
+      "loss": 0.8663,
+      "step": 220
+    },
+    {
+      "epoch": 0.7177033492822966,
+      "grad_norm": 0.9097235798835754,
+      "learning_rate": 2.9305358207127163e-05,
+      "loss": 0.8667,
+      "step": 225
+    },
+    {
+      "epoch": 0.733652312599681,
+      "grad_norm": 0.8799058198928833,
+      "learning_rate": 2.9257032109518594e-05,
+      "loss": 0.8142,
+      "step": 230
+    },
+    {
+      "epoch": 0.7496012759170654,
+      "grad_norm": 1.0260847806930542,
+      "learning_rate": 2.9207123636308372e-05,
+      "loss": 0.8293,
+      "step": 235
+    },
+    {
+      "epoch": 0.7655502392344498,
+      "grad_norm": 0.9114710688591003,
+      "learning_rate": 2.9155638326794564e-05,
+      "loss": 0.8146,
+      "step": 240
+    },
+    {
+      "epoch": 0.7814992025518341,
+      "grad_norm": 0.9736585021018982,
+      "learning_rate": 2.9102581895286923e-05,
+      "loss": 0.8152,
+      "step": 245
+    },
+    {
+      "epoch": 0.7974481658692185,
+      "grad_norm": 1.001495361328125,
+      "learning_rate": 2.9047960230472655e-05,
+      "loss": 0.869,
+      "step": 250
+    },
+    {
+      "epoch": 0.8133971291866029,
+      "grad_norm": 1.0725624561309814,
+      "learning_rate": 2.8991779394762875e-05,
+      "loss": 0.8488,
+      "step": 255
+    },
+    {
+      "epoch": 0.8293460925039873,
+      "grad_norm": 1.0831763744354248,
+      "learning_rate": 2.8934045623619697e-05,
+      "loss": 0.809,
+      "step": 260
+    },
+    {
+      "epoch": 0.8452950558213717,
+      "grad_norm": 1.060348629951477,
+      "learning_rate": 2.88747653248642e-05,
+      "loss": 0.8019,
+      "step": 265
+    },
+    {
+      "epoch": 0.861244019138756,
+      "grad_norm": 1.001414179801941,
+      "learning_rate": 2.8813945077965217e-05,
+      "loss": 0.7483,
+      "step": 270
+    },
+    {
+      "epoch": 0.8771929824561403,
+      "grad_norm": 1.0342857837677002,
+      "learning_rate": 2.875159163330909e-05,
+      "loss": 0.7767,
+      "step": 275
+    },
+    {
+      "epoch": 0.8931419457735247,
+      "grad_norm": 1.079073190689087,
+      "learning_rate": 2.8687711911450436e-05,
+      "loss": 0.7291,
+      "step": 280
+    },
+    {
+      "epoch": 0.9090909090909091,
+      "grad_norm": 1.0348238945007324,
+      "learning_rate": 2.862231300234407e-05,
+      "loss": 0.7566,
+      "step": 285
+    },
+    {
+      "epoch": 0.9250398724082934,
+      "grad_norm": 0.9865140318870544,
+      "learning_rate": 2.8555402164558058e-05,
+      "loss": 0.7365,
+      "step": 290
+    },
+    {
+      "epoch": 0.9409888357256778,
+      "grad_norm": 0.9957141280174255,
+      "learning_rate": 2.8486986824468134e-05,
+      "loss": 0.7394,
+      "step": 295
+    },
+    {
+      "epoch": 0.9569377990430622,
+      "grad_norm": 1.2500948905944824,
+      "learning_rate": 2.841707457543343e-05,
+      "loss": 0.7223,
+      "step": 300
+    },
+    {
+      "epoch": 0.9728867623604466,
+      "grad_norm": 1.0007729530334473,
+      "learning_rate": 2.8345673176953692e-05,
+      "loss": 0.7563,
+      "step": 305
+    },
+    {
+      "epoch": 0.988835725677831,
+      "grad_norm": 1.1462596654891968,
+      "learning_rate": 2.8272790553808082e-05,
+      "loss": 0.7193,
+      "step": 310
+    },
+    {
+      "epoch": 1.003189792663477,
+      "grad_norm": 1.1253020763397217,
+      "learning_rate": 2.8198434795175585e-05,
+      "loss": 0.6737,
+      "step": 315
+    },
+    {
+      "epoch": 1.0191387559808613,
+      "grad_norm": 1.070001482963562,
+      "learning_rate": 2.8122614153737228e-05,
+      "loss": 0.6186,
+      "step": 320
+    },
+    {
+      "epoch": 1.0350877192982457,
+      "grad_norm": 1.606988549232483,
+      "learning_rate": 2.8045337044760103e-05,
+      "loss": 0.6471,
+      "step": 325
+    },
+    {
+      "epoch": 1.0510366826156299,
+      "grad_norm": 1.4467155933380127,
+      "learning_rate": 2.7966612045163363e-05,
+      "loss": 0.6399,
+      "step": 330
+    },
+    {
+      "epoch": 1.0669856459330143,
+      "grad_norm": 1.2260429859161377,
+      "learning_rate": 2.7886447892566284e-05,
+      "loss": 0.6327,
+      "step": 335
+    },
+    {
+      "epoch": 1.0829346092503986,
+      "grad_norm": 0.9376658797264099,
+      "learning_rate": 2.7804853484318488e-05,
+      "loss": 0.6429,
+      "step": 340
+    },
+    {
+      "epoch": 1.098883572567783,
+      "grad_norm": 1.349844217300415,
+      "learning_rate": 2.7721837876512425e-05,
+      "loss": 0.6507,
+      "step": 345
+    },
+    {
+      "epoch": 1.1148325358851674,
+      "grad_norm": 1.0897091627120972,
+      "learning_rate": 2.763741028297824e-05,
+      "loss": 0.6087,
+      "step": 350
+    },
+    {
+      "epoch": 1.1307814992025518,
+      "grad_norm": 1.189454436302185,
+      "learning_rate": 2.755158007426116e-05,
+      "loss": 0.593,
+      "step": 355
+    },
+    {
+      "epoch": 1.1467304625199362,
+      "grad_norm": 1.334564447402954,
+      "learning_rate": 2.746435677658146e-05,
+      "loss": 0.5689,
+      "step": 360
+    },
+    {
+      "epoch": 1.1626794258373205,
+      "grad_norm": 1.5465506315231323,
+      "learning_rate": 2.7375750070777114e-05,
+      "loss": 0.5908,
+      "step": 365
+    },
+    {
+      "epoch": 1.178628389154705,
+      "grad_norm": 1.5634928941726685,
+      "learning_rate": 2.7285769791229394e-05,
+      "loss": 0.5839,
+      "step": 370
+    },
+    {
+      "epoch": 1.1945773524720893,
+      "grad_norm": 1.276563048362732,
+      "learning_rate": 2.7194425924771317e-05,
+      "loss": 0.5979,
+      "step": 375
+    },
+    {
+      "epoch": 1.2105263157894737,
+      "grad_norm": 1.5907806158065796,
+      "learning_rate": 2.7101728609579216e-05,
+      "loss": 0.567,
+      "step": 380
+    },
+    {
+      "epoch": 1.226475279106858,
+      "grad_norm": 1.1192567348480225,
+      "learning_rate": 2.700768813404754e-05,
+      "loss": 0.5264,
+      "step": 385
+    },
+    {
+      "epoch": 1.2424242424242424,
+      "grad_norm": 1.3511794805526733,
+      "learning_rate": 2.691231493564693e-05,
+      "loss": 0.5866,
+      "step": 390
+    },
+    {
+      "epoch": 1.2583732057416268,
+      "grad_norm": 1.1103533506393433,
+      "learning_rate": 2.6815619599765775e-05,
+      "loss": 0.5649,
+      "step": 395
+    },
+    {
+      "epoch": 1.2743221690590112,
+      "grad_norm": 1.3668354749679565,
+      "learning_rate": 2.6717612858535356e-05,
+      "loss": 0.5675,
+      "step": 400
+    },
+    {
+      "epoch": 1.2902711323763956,
+      "grad_norm": 1.1439465284347534,
+      "learning_rate": 2.6618305589638695e-05,
+      "loss": 0.5416,
+      "step": 405
+    },
+    {
+      "epoch": 1.30622009569378,
+      "grad_norm": 1.1497900485992432,
+      "learning_rate": 2.651770881510325e-05,
+      "loss": 0.5459,
+      "step": 410
+    },
+    {
+      "epoch": 1.3221690590111643,
+      "grad_norm": 1.195949673652649,
+      "learning_rate": 2.641583370007759e-05,
+      "loss": 0.5699,
+      "step": 415
+    },
+    {
+      "epoch": 1.3381180223285487,
+      "grad_norm": 1.229684829711914,
+      "learning_rate": 2.6312691551592177e-05,
+      "loss": 0.5608,
+      "step": 420
+    },
+    {
+      "epoch": 1.354066985645933,
+      "grad_norm": 1.300110101699829,
+      "learning_rate": 2.620829381730443e-05,
+      "loss": 0.5245,
+      "step": 425
+    },
+    {
+      "epoch": 1.3700159489633175,
+      "grad_norm": 1.047258734703064,
+      "learning_rate": 2.6102652084228125e-05,
+      "loss": 0.5543,
+      "step": 430
+    },
+    {
+      "epoch": 1.3859649122807016,
+      "grad_norm": 1.314669132232666,
+      "learning_rate": 2.5995778077447393e-05,
+      "loss": 0.5681,
+      "step": 435
+    },
+    {
+      "epoch": 1.401913875598086,
+      "grad_norm": 1.230797290802002,
+      "learning_rate": 2.5887683658815358e-05,
+      "loss": 0.5202,
+      "step": 440
+    },
+    {
+      "epoch": 1.4178628389154704,
+      "grad_norm": 1.1360483169555664,
+      "learning_rate": 2.5778380825637592e-05,
+      "loss": 0.5447,
+      "step": 445
+    },
+    {
+      "epoch": 1.4338118022328548,
+      "grad_norm": 1.2857158184051514,
+      "learning_rate": 2.5667881709340532e-05,
+      "loss": 0.5267,
+      "step": 450
+    },
+    {
+      "epoch": 1.4497607655502391,
+      "grad_norm": 1.1709908246994019,
+      "learning_rate": 2.5556198574125053e-05,
+      "loss": 0.4973,
+      "step": 455
+    },
+    {
+      "epoch": 1.4657097288676235,
+      "grad_norm": 1.2642844915390015,
+      "learning_rate": 2.5443343815605262e-05,
+      "loss": 0.5027,
+      "step": 460
+    },
+    {
+      "epoch": 1.481658692185008,
+      "grad_norm": 1.2073395252227783,
+      "learning_rate": 2.532932995943272e-05,
+      "loss": 0.4725,
+      "step": 465
+    },
+    {
+      "epoch": 1.4976076555023923,
+      "grad_norm": 1.2468843460083008,
+      "learning_rate": 2.5214169659906207e-05,
+      "loss": 0.497,
+      "step": 470
+    },
+    {
+      "epoch": 1.5135566188197767,
+      "grad_norm": 1.1961091756820679,
+      "learning_rate": 2.509787569856728e-05,
+      "loss": 0.5276,
+      "step": 475
+    },
+    {
+      "epoch": 1.529505582137161,
+      "grad_norm": 1.1429219245910645,
+      "learning_rate": 2.4980460982781625e-05,
+      "loss": 0.492,
+      "step": 480
+    },
+    {
+      "epoch": 1.5454545454545454,
+      "grad_norm": 1.090941309928894,
+      "learning_rate": 2.486193854430649e-05,
+      "loss": 0.5089,
+      "step": 485
+    },
+    {
+      "epoch": 1.5614035087719298,
+      "grad_norm": 1.1492680311203003,
+      "learning_rate": 2.4742321537844305e-05,
+      "loss": 0.4591,
+      "step": 490
+    },
+    {
+      "epoch": 1.5773524720893142,
+      "grad_norm": 1.1921738386154175,
+      "learning_rate": 2.4621623239582637e-05,
+      "loss": 0.4677,
+      "step": 495
+    },
+    {
+      "epoch": 1.5933014354066986,
+      "grad_norm": 1.3825076818466187,
+      "learning_rate": 2.4499857045720705e-05,
+      "loss": 0.4611,
+      "step": 500
+    },
+    {
+      "epoch": 1.609250398724083,
+      "grad_norm": 1.1410709619522095,
+      "learning_rate": 2.437703647098253e-05,
+      "loss": 0.485,
+      "step": 505
+    },
+    {
+      "epoch": 1.6251993620414673,
+      "grad_norm": 1.227121353149414,
+      "learning_rate": 2.4253175147116943e-05,
+      "loss": 0.4807,
+      "step": 510
+    },
+    {
+      "epoch": 1.6411483253588517,
+      "grad_norm": 1.276689052581787,
+      "learning_rate": 2.4128286821384616e-05,
+      "loss": 0.448,
+      "step": 515
+    },
+    {
+      "epoch": 1.657097288676236,
+      "grad_norm": 1.0904544591903687,
+      "learning_rate": 2.400238535503228e-05,
+      "loss": 0.4679,
+      "step": 520
+    },
+    {
+      "epoch": 1.6730462519936204,
+      "grad_norm": 1.2670605182647705,
+      "learning_rate": 2.3875484721754245e-05,
+      "loss": 0.4675,
+      "step": 525
+    },
+    {
+      "epoch": 1.6889952153110048,
+      "grad_norm": 1.2647444009780884,
+      "learning_rate": 2.3747599006141497e-05,
+      "loss": 0.4567,
+      "step": 530
+    },
+    {
+      "epoch": 1.7049441786283892,
+      "grad_norm": 1.166435718536377,
+      "learning_rate": 2.3618742402118452e-05,
+      "loss": 0.4699,
+      "step": 535
+    },
+    {
+      "epoch": 1.7208931419457736,
+      "grad_norm": 1.3012378215789795,
+      "learning_rate": 2.3488929211367596e-05,
+      "loss": 0.45,
+      "step": 540
+    },
+    {
+      "epoch": 1.736842105263158,
+      "grad_norm": 1.3723088502883911,
+      "learning_rate": 2.3358173841742128e-05,
+      "loss": 0.3948,
+      "step": 545
+    },
+    {
+      "epoch": 1.7527910685805423,
+      "grad_norm": 1.2429518699645996,
+      "learning_rate": 2.3226490805666875e-05,
+      "loss": 0.3917,
+      "step": 550
+    },
+    {
+      "epoch": 1.7687400318979267,
+      "grad_norm": 1.4225772619247437,
+      "learning_rate": 2.3093894718527552e-05,
+      "loss": 0.438,
+      "step": 555
+    },
+    {
+      "epoch": 1.784688995215311,
+      "grad_norm": 1.1522200107574463,
+      "learning_rate": 2.2960400297048618e-05,
+      "loss": 0.405,
+      "step": 560
+    },
+    {
+      "epoch": 1.8006379585326955,
+      "grad_norm": 1.2812358140945435,
+      "learning_rate": 2.282602235765988e-05,
+      "loss": 0.4437,
+      "step": 565
+    },
+    {
+      "epoch": 1.8165869218500799,
+      "grad_norm": 1.1189430952072144,
+      "learning_rate": 2.2690775814852032e-05,
+      "loss": 0.4517,
+      "step": 570
+    },
+    {
+      "epoch": 1.8325358851674642,
+      "grad_norm": 1.174291729927063,
+      "learning_rate": 2.25546756795213e-05,
+      "loss": 0.3906,
+      "step": 575
+    },
+    {
+      "epoch": 1.8484848484848486,
+      "grad_norm": 1.1886178255081177,
+      "learning_rate": 2.241773705730341e-05,
+      "loss": 0.4126,
+      "step": 580
+    },
+    {
+      "epoch": 1.864433811802233,
+      "grad_norm": 1.2971218824386597,
+      "learning_rate": 2.2279975146897016e-05,
+      "loss": 0.44,
+      "step": 585
+    },
+    {
+      "epoch": 1.8803827751196174,
+      "grad_norm": 1.4141205549240112,
+      "learning_rate": 2.21414052383768e-05,
+      "loss": 0.3978,
+      "step": 590
+    },
+    {
+      "epoch": 1.8963317384370018,
+      "grad_norm": 1.3251820802688599,
+      "learning_rate": 2.2002042711496483e-05,
+      "loss": 0.3748,
+      "step": 595
+    },
+    {
+      "epoch": 1.912280701754386,
+      "grad_norm": 1.2246263027191162,
+      "learning_rate": 2.1861903033981772e-05,
+      "loss": 0.404,
+      "step": 600
+    },
+    {
+      "epoch": 1.9282296650717703,
+      "grad_norm": 1.3702830076217651,
+      "learning_rate": 2.1721001759813677e-05,
+      "loss": 0.3759,
+      "step": 605
+    },
+    {
+      "epoch": 1.9441786283891547,
+      "grad_norm": 1.238807201385498,
+      "learning_rate": 2.157935452750214e-05,
+      "loss": 0.3727,
+      "step": 610
+    },
+    {
+      "epoch": 1.960127591706539,
+      "grad_norm": 1.403578281402588,
+      "learning_rate": 2.1436977058350364e-05,
+      "loss": 0.3812,
+      "step": 615
+    },
+    {
+      "epoch": 1.9760765550239234,
+      "grad_norm": 1.1821719408035278,
+      "learning_rate": 2.1293885154709885e-05,
+      "loss": 0.4449,
+      "step": 620
+    },
+    {
+      "epoch": 1.9920255183413078,
+      "grad_norm": 1.2228764295578003,
+      "learning_rate": 2.115009469822672e-05,
+      "loss": 0.3727,
+      "step": 625
+    },
+    {
+      "epoch": 2.006379585326954,
+      "grad_norm": 1.0864733457565308,
+      "learning_rate": 2.100562164807865e-05,
+      "loss": 0.349,
+      "step": 630
+    },
+    {
+      "epoch": 2.0223285486443383,
+      "grad_norm": 1.207122802734375,
+      "learning_rate": 2.0860482039203933e-05,
+      "loss": 0.3144,
+      "step": 635
+    },
+    {
+      "epoch": 2.0382775119617227,
+      "grad_norm": 1.2051016092300415,
+      "learning_rate": 2.071469198052161e-05,
+      "loss": 0.2479,
+      "step": 640
+    },
+    {
+      "epoch": 2.054226475279107,
+      "grad_norm": 1.2147462368011475,
+      "learning_rate": 2.0568267653143566e-05,
+      "loss": 0.3296,
+      "step": 645
+    },
+    {
+      "epoch": 2.0701754385964914,
+      "grad_norm": 1.198175311088562,
+      "learning_rate": 2.0421225308578628e-05,
+      "loss": 0.2963,
+      "step": 650
+    },
+    {
+      "epoch": 2.0861244019138754,
+      "grad_norm": 1.1988497972488403,
+      "learning_rate": 2.027358126692881e-05,
+      "loss": 0.2827,
+      "step": 655
+    },
+    {
+      "epoch": 2.1020733652312598,
+      "grad_norm": 1.2769469022750854,
+      "learning_rate": 2.0125351915077965e-05,
+      "loss": 0.2925,
+      "step": 660
+    },
+    {
+      "epoch": 2.118022328548644,
+      "grad_norm": 1.300974726676941,
+      "learning_rate": 1.9976553704873008e-05,
+      "loss": 0.3019,
+      "step": 665
+    },
+    {
+      "epoch": 2.1339712918660285,
+      "grad_norm": 1.2205158472061157,
+      "learning_rate": 1.982720315129796e-05,
+      "loss": 0.3351,
+      "step": 670
+    },
+    {
+      "epoch": 2.149920255183413,
+      "grad_norm": 1.288529872894287,
+      "learning_rate": 1.9677316830640948e-05,
+      "loss": 0.2885,
+      "step": 675
+    },
+    {
+      "epoch": 2.1658692185007973,
+      "grad_norm": 1.2325316667556763,
+      "learning_rate": 1.952691137865441e-05,
+      "loss": 0.301,
+      "step": 680
+    },
+    {
+      "epoch": 2.1818181818181817,
+      "grad_norm": 1.3443962335586548,
+      "learning_rate": 1.9376003488708748e-05,
+      "loss": 0.2753,
+      "step": 685
+    },
+    {
+      "epoch": 2.197767145135566,
+      "grad_norm": 1.3362351655960083,
+      "learning_rate": 1.9224609909939486e-05,
+      "loss": 0.2823,
+      "step": 690
+    },
+    {
+      "epoch": 2.2137161084529504,
+      "grad_norm": 1.251630187034607,
+      "learning_rate": 1.907274744538834e-05,
+      "loss": 0.3052,
+      "step": 695
+    },
+    {
+      "epoch": 2.229665071770335,
+      "grad_norm": 1.3313004970550537,
+      "learning_rate": 1.8920432950138257e-05,
+      "loss": 0.2703,
+      "step": 700
+    },
+    {
+      "epoch": 2.245614035087719,
+      "grad_norm": 1.3427534103393555,
+      "learning_rate": 1.876768332944267e-05,
+      "loss": 0.2677,
+      "step": 705
+    },
+    {
+      "epoch": 2.2615629984051036,
+      "grad_norm": 1.3929996490478516,
+      "learning_rate": 1.8614515536849215e-05,
+      "loss": 0.2697,
+      "step": 710
+    },
+    {
+      "epoch": 2.277511961722488,
+      "grad_norm": 1.151013731956482,
+      "learning_rate": 1.8460946572318055e-05,
+      "loss": 0.2988,
+      "step": 715
+    },
+    {
+      "epoch": 2.2934609250398723,
+      "grad_norm": 1.2603211402893066,
+      "learning_rate": 1.8306993480335078e-05,
+      "loss": 0.2714,
+      "step": 720
+    },
+    {
+      "epoch": 2.3094098883572567,
+      "grad_norm": 1.6505305767059326,
+      "learning_rate": 1.8152673348020155e-05,
+      "loss": 0.2818,
+      "step": 725
+    },
+    {
+      "epoch": 2.325358851674641,
+      "grad_norm": 1.1775126457214355,
+      "learning_rate": 1.7998003303230634e-05,
+      "loss": 0.2713,
+      "step": 730
+    },
+    {
+      "epoch": 2.3413078149920254,
+      "grad_norm": 1.44676673412323,
+      "learning_rate": 1.7843000512660344e-05,
+      "loss": 0.2759,
+      "step": 735
+    },
+    {
+      "epoch": 2.35725677830941,
+      "grad_norm": 1.4021360874176025,
+      "learning_rate": 1.7687682179934285e-05,
+      "loss": 0.2787,
+      "step": 740
+    },
+    {
+      "epoch": 2.373205741626794,
+      "grad_norm": 1.2734301090240479,
+      "learning_rate": 1.7532065543699202e-05,
+      "loss": 0.2672,
+      "step": 745
+    },
+    {
+      "epoch": 2.3891547049441786,
+      "grad_norm": 1.1749529838562012,
+      "learning_rate": 1.7376167875710296e-05,
+      "loss": 0.3058,
+      "step": 750
+    },
+    {
+      "epoch": 2.405103668261563,
+      "grad_norm": 1.1213788986206055,
+      "learning_rate": 1.7220006478914218e-05,
+      "loss": 0.2389,
+      "step": 755
+    },
+    {
+      "epoch": 2.4210526315789473,
+      "grad_norm": 1.0899239778518677,
+      "learning_rate": 1.7063598685528675e-05,
+      "loss": 0.2255,
+      "step": 760
+    },
+    {
+      "epoch": 2.4370015948963317,
+      "grad_norm": 1.2822694778442383,
+      "learning_rate": 1.6906961855118703e-05,
+      "loss": 0.2697,
+      "step": 765
+    },
+    {
+      "epoch": 2.452950558213716,
+      "grad_norm": 1.1576502323150635,
+      "learning_rate": 1.675011337266996e-05,
+      "loss": 0.2625,
+      "step": 770
+    },
+    {
+      "epoch": 2.4688995215311005,
+      "grad_norm": 1.1249899864196777,
+      "learning_rate": 1.6593070646659175e-05,
+      "loss": 0.2568,
+      "step": 775
+    },
+    {
+      "epoch": 2.484848484848485,
+      "grad_norm": 1.1156028509140015,
+      "learning_rate": 1.6435851107122013e-05,
+      "loss": 0.2426,
+      "step": 780
+    },
+    {
+      "epoch": 2.5007974481658692,
+      "grad_norm": 1.2244150638580322,
+      "learning_rate": 1.6278472203718512e-05,
+      "loss": 0.2313,
+      "step": 785
+    },
+    {
+      "epoch": 2.5167464114832536,
+      "grad_norm": 1.187371850013733,
+      "learning_rate": 1.6120951403796367e-05,
+      "loss": 0.2443,
+      "step": 790
+    },
+    {
+      "epoch": 2.532695374800638,
+      "grad_norm": 1.413248896598816,
+      "learning_rate": 1.5963306190452238e-05,
+      "loss": 0.2811,
+      "step": 795
+    },
+    {
+      "epoch": 2.5486443381180224,
+      "grad_norm": 1.2305434942245483,
+      "learning_rate": 1.5805554060591337e-05,
+      "loss": 0.2532,
+      "step": 800
+    },
+    {
+      "epoch": 2.5645933014354068,
+      "grad_norm": 1.2348670959472656,
+      "learning_rate": 1.5647712522985442e-05,
+      "loss": 0.2367,
+      "step": 805
+    },
+    {
+      "epoch": 2.580542264752791,
+      "grad_norm": 1.3350554704666138,
+      "learning_rate": 1.5489799096329607e-05,
+      "loss": 0.2482,
+      "step": 810
+    },
+    {
+      "epoch": 2.5964912280701755,
+      "grad_norm": 1.2583109140396118,
+      "learning_rate": 1.5331831307297803e-05,
+      "loss": 0.2505,
+      "step": 815
+    },
+    {
+      "epoch": 2.61244019138756,
+      "grad_norm": 1.2715870141983032,
+      "learning_rate": 1.5173826688597631e-05,
+      "loss": 0.2417,
+      "step": 820
+    },
+    {
+      "epoch": 2.6283891547049443,
+      "grad_norm": 1.3642032146453857,
+      "learning_rate": 1.5015802777024382e-05,
+      "loss": 0.2211,
+      "step": 825
+    },
+    {
+      "epoch": 2.6443381180223287,
+      "grad_norm": 1.260809063911438,
+      "learning_rate": 1.4857777111514646e-05,
+      "loss": 0.2222,
+      "step": 830
+    },
+    {
+      "epoch": 2.660287081339713,
+      "grad_norm": 1.310431718826294,
+      "learning_rate": 1.4699767231199683e-05,
+      "loss": 0.2419,
+      "step": 835
+    },
+    {
+      "epoch": 2.6762360446570974,
+      "grad_norm": 1.2949233055114746,
+      "learning_rate": 1.4541790673458762e-05,
+      "loss": 0.2281,
+      "step": 840
+    },
+    {
+      "epoch": 2.692185007974482,
+      "grad_norm": 1.227954626083374,
+      "learning_rate": 1.4383864971972724e-05,
+      "loss": 0.225,
+      "step": 845
+    },
+    {
+      "epoch": 2.708133971291866,
+      "grad_norm": 1.3981674909591675,
+      "learning_rate": 1.4226007654777903e-05,
+      "loss": 0.23,
+      "step": 850
+    },
+    {
+      "epoch": 2.7240829346092506,
+      "grad_norm": 1.227035641670227,
+      "learning_rate": 1.4068236242320728e-05,
+      "loss": 0.2424,
+      "step": 855
+    },
+    {
+      "epoch": 2.740031897926635,
+      "grad_norm": 1.1334996223449707,
+      "learning_rate": 1.3910568245513128e-05,
+      "loss": 0.2299,
+      "step": 860
+    },
+    {
+      "epoch": 2.7559808612440193,
+      "grad_norm": 1.2656800746917725,
+      "learning_rate": 1.3753021163789027e-05,
+      "loss": 0.2309,
+      "step": 865
+    },
+    {
+      "epoch": 2.7719298245614032,
+      "grad_norm": 1.0391762256622314,
+      "learning_rate": 1.3595612483162086e-05,
+      "loss": 0.2143,
+      "step": 870
+    },
+    {
+      "epoch": 2.787878787878788,
+      "grad_norm": 1.2347395420074463,
+      "learning_rate": 1.3438359674284941e-05,
+      "loss": 0.2235,
+      "step": 875
+    },
+    {
+      "epoch": 2.803827751196172,
+      "grad_norm": 1.329914927482605,
+      "learning_rate": 1.328128019051018e-05,
+      "loss": 0.2336,
+      "step": 880
+    },
+    {
+      "epoch": 2.819776714513557,
+      "grad_norm": 1.2936028242111206,
+      "learning_rate": 1.3124391465953164e-05,
+      "loss": 0.1938,
+      "step": 885
+    },
+    {
+      "epoch": 2.8357256778309408,
+      "grad_norm": 1.2581311464309692,
+      "learning_rate": 1.2967710913557067e-05,
+      "loss": 0.2341,
+      "step": 890
+    },
+    {
+      "epoch": 2.8516746411483256,
+      "grad_norm": 1.3273301124572754,
+      "learning_rate": 1.2811255923160212e-05,
+      "loss": 0.2234,
+      "step": 895
+    },
+    {
+      "epoch": 2.8676236044657095,
+      "grad_norm": 1.268049955368042,
+      "learning_rate": 1.2655043859565995e-05,
+      "loss": 0.23,
+      "step": 900
+    },
+    {
+      "epoch": 2.8835725677830943,
+      "grad_norm": 1.2851536273956299,
+      "learning_rate": 1.249909206061557e-05,
+      "loss": 0.2308,
+      "step": 905
+    },
+    {
+      "epoch": 2.8995215311004783,
+      "grad_norm": 1.1563853025436401,
+      "learning_rate": 1.2343417835263556e-05,
+      "loss": 0.2149,
+      "step": 910
+    },
+    {
+      "epoch": 2.915470494417863,
+      "grad_norm": 1.5606164932250977,
+      "learning_rate": 1.21880384616569e-05,
+      "loss": 0.2083,
+      "step": 915
+    },
+    {
+      "epoch": 2.931419457735247,
+      "grad_norm": 1.1263948678970337,
+      "learning_rate": 1.2032971185217241e-05,
+      "loss": 0.1733,
+      "step": 920
+    },
+    {
+      "epoch": 2.9473684210526314,
+      "grad_norm": 1.2153230905532837,
+      "learning_rate": 1.1878233216726798e-05,
+      "loss": 0.2001,
+      "step": 925
+    },
+    {
+      "epoch": 2.963317384370016,
+      "grad_norm": 1.3377314805984497,
+      "learning_rate": 1.1723841730418198e-05,
+      "loss": 0.2026,
+      "step": 930
+    },
+    {
+      "epoch": 2.9792663476874,
+      "grad_norm": 1.1396352052688599,
+      "learning_rate": 1.1569813862068307e-05,
+      "loss": 0.1944,
+      "step": 935
+    },
+    {
+      "epoch": 2.9952153110047846,
+      "grad_norm": 1.495429277420044,
+      "learning_rate": 1.1416166707096353e-05,
+      "loss": 0.2012,
+      "step": 940
+    },
+    {
+      "epoch": 3.0095693779904304,
+      "grad_norm": 1.0835736989974976,
+      "learning_rate": 1.1262917318666517e-05,
+      "loss": 0.1885,
+      "step": 945
+    },
+    {
+      "epoch": 3.025518341307815,
+      "grad_norm": 1.299534797668457,
+      "learning_rate": 1.111008270579521e-05,
+      "loss": 0.1783,
+      "step": 950
+    },
+    {
+      "epoch": 3.041467304625199,
+      "grad_norm": 1.0937697887420654,
+      "learning_rate": 1.0957679831463287e-05,
+      "loss": 0.1489,
+      "step": 955
+    },
+    {
+      "epoch": 3.0574162679425836,
+      "grad_norm": 1.0105271339416504,
+      "learning_rate": 1.0805725610733292e-05,
+      "loss": 0.1752,
+      "step": 960
+    },
+    {
+      "epoch": 3.073365231259968,
+      "grad_norm": 1.2927125692367554,
+      "learning_rate": 1.0654236908872103e-05,
+      "loss": 0.1578,
+      "step": 965
+    },
+    {
+      "epoch": 3.0893141945773523,
+      "grad_norm": 1.1609300374984741,
+      "learning_rate": 1.050323053947907e-05,
+      "loss": 0.1423,
+      "step": 970
+    },
+    {
+      "epoch": 3.1052631578947367,
+      "grad_norm": 1.0901206731796265,
+      "learning_rate": 1.0352723262619878e-05,
+      "loss": 0.1497,
+      "step": 975
+    },
+    {
+      "epoch": 3.121212121212121,
+      "grad_norm": 0.9939388036727905,
+      "learning_rate": 1.0202731782966363e-05,
+      "loss": 0.1674,
+      "step": 980
+    },
+    {
+      "epoch": 3.1371610845295055,
+      "grad_norm": 1.1406632661819458,
+      "learning_rate": 1.0053272747942472e-05,
+      "loss": 0.1291,
+      "step": 985
+    },
+    {
+      "epoch": 3.15311004784689,
+      "grad_norm": 1.2126046419143677,
+      "learning_rate": 9.904362745876609e-06,
+      "loss": 0.1455,
+      "step": 990
+    },
+    {
+      "epoch": 3.1690590111642742,
+      "grad_norm": 1.2059701681137085,
+      "learning_rate": 9.756018304160458e-06,
+      "loss": 0.1273,
+      "step": 995
+    },
+    {
+      "epoch": 3.1850079744816586,
+      "grad_norm": 1.1024702787399292,
+      "learning_rate": 9.608255887414673e-06,
+      "loss": 0.1509,
+      "step": 1000
+    },
+    {
+      "epoch": 3.200956937799043,
+      "grad_norm": 1.252359390258789,
+      "learning_rate": 9.46109189566145e-06,
+      "loss": 0.1488,
+      "step": 1005
+    },
+    {
+      "epoch": 3.2169059011164274,
+      "grad_norm": 0.9932821393013,
+      "learning_rate": 9.314542662504316e-06,
+      "loss": 0.158,
+      "step": 1010
+    },
+    {
+      "epoch": 3.2328548644338118,
+      "grad_norm": 1.2389966249465942,
+      "learning_rate": 9.168624453315284e-06,
+      "loss": 0.148,
+      "step": 1015
+    },
+    {
+      "epoch": 3.248803827751196,
+      "grad_norm": 1.2603164911270142,
+      "learning_rate": 9.023353463429556e-06,
+      "loss": 0.1614,
+      "step": 1020
+    },
+    {
+      "epoch": 3.2647527910685805,
+      "grad_norm": 0.96819669008255,
+      "learning_rate": 8.878745816348025e-06,
+      "loss": 0.1436,
+      "step": 1025
+    },
+    {
+      "epoch": 3.280701754385965,
+      "grad_norm": 0.9492572546005249,
+      "learning_rate": 8.734817561947759e-06,
+      "loss": 0.151,
+      "step": 1030
+    },
+    {
+      "epoch": 3.2966507177033493,
+      "grad_norm": 1.0807409286499023,
+      "learning_rate": 8.591584674700613e-06,
+      "loss": 0.1484,
+      "step": 1035
+    },
+    {
+      "epoch": 3.3125996810207337,
+      "grad_norm": 1.1965206861495972,
+      "learning_rate": 8.449063051900233e-06,
+      "loss": 0.1518,
+      "step": 1040
+    },
+    {
+      "epoch": 3.328548644338118,
+      "grad_norm": 1.0824549198150635,
+      "learning_rate": 8.307268511897667e-06,
+      "loss": 0.1423,
+      "step": 1045
+    },
+    {
+      "epoch": 3.3444976076555024,
+      "grad_norm": 1.178969383239746,
+      "learning_rate": 8.166216792345648e-06,
+      "loss": 0.1415,
+      "step": 1050
+    },
+    {
+      "epoch": 3.360446570972887,
+      "grad_norm": 1.2653928995132446,
+      "learning_rate": 8.02592354845194e-06,
+      "loss": 0.1392,
+      "step": 1055
+    },
+    {
+      "epoch": 3.376395534290271,
+      "grad_norm": 1.0409033298492432,
+      "learning_rate": 7.886404351241731e-06,
+      "loss": 0.1189,
+      "step": 1060
+    },
+    {
+      "epoch": 3.3923444976076556,
+      "grad_norm": 1.204380989074707,
+      "learning_rate": 7.747674685829451e-06,
+      "loss": 0.1465,
+      "step": 1065
+    },
+    {
+      "epoch": 3.40829346092504,
+      "grad_norm": 1.1518088579177856,
+      "learning_rate": 7.609749949700084e-06,
+      "loss": 0.1081,
+      "step": 1070
+    },
+    {
+      "epoch": 3.4242424242424243,
+      "grad_norm": 1.0367203950881958,
+      "learning_rate": 7.472645451000214e-06,
+      "loss": 0.1154,
+      "step": 1075
+    },
+    {
+      "epoch": 3.4401913875598087,
+      "grad_norm": 1.1489876508712769,
+      "learning_rate": 7.3363764068389674e-06,
+      "loss": 0.1339,
+      "step": 1080
+    },
+    {
+      "epoch": 3.456140350877193,
+      "grad_norm": 1.1962957382202148,
+      "learning_rate": 7.200957941599126e-06,
+      "loss": 0.1519,
+      "step": 1085
+    },
+    {
+      "epoch": 3.4720893141945774,
+      "grad_norm": 1.111609697341919,
+      "learning_rate": 7.066405085258427e-06,
+      "loss": 0.1386,
+      "step": 1090
+    },
+    {
+      "epoch": 3.488038277511962,
+      "grad_norm": 1.0738730430603027,
+      "learning_rate": 6.932732771721447e-06,
+      "loss": 0.1281,
+      "step": 1095
+    },
+    {
+      "epoch": 3.503987240829346,
+      "grad_norm": 1.0701189041137695,
+      "learning_rate": 6.799955837162082e-06,
+      "loss": 0.1189,
+      "step": 1100
+    },
+    {
+      "epoch": 3.5199362041467306,
+      "grad_norm": 1.2711684703826904,
+      "learning_rate": 6.668089018376892e-06,
+      "loss": 0.1256,
+      "step": 1105
+    },
+    {
+      "epoch": 3.535885167464115,
+      "grad_norm": 0.9734954833984375,
+      "learning_rate": 6.537146951149463e-06,
+      "loss": 0.1268,
+      "step": 1110
+    },
+    {
+      "epoch": 3.5518341307814993,
+      "grad_norm": 1.0430442094802856,
+      "learning_rate": 6.407144168626038e-06,
+      "loss": 0.1308,
+      "step": 1115
+    },
+    {
+      "epoch": 3.5677830940988837,
+      "grad_norm": 1.088063359260559,
+      "learning_rate": 6.2780950997024345e-06,
+      "loss": 0.1225,
+      "step": 1120
+    },
+    {
+      "epoch": 3.583732057416268,
+      "grad_norm": 1.3580750226974487,
+      "learning_rate": 6.1500140674226575e-06,
+      "loss": 0.1486,
+      "step": 1125
+    },
+    {
+      "epoch": 3.5996810207336525,
+      "grad_norm": 1.069808006286621,
+      "learning_rate": 6.02291528738914e-06,
+      "loss": 0.1387,
+      "step": 1130
+    },
+    {
+      "epoch": 3.6156299840510364,
+      "grad_norm": 1.0957990884780884,
+      "learning_rate": 5.896812866185011e-06,
+      "loss": 0.133,
+      "step": 1135
+    },
+    {
+      "epoch": 3.6315789473684212,
+      "grad_norm": 1.1064181327819824,
+      "learning_rate": 5.7717207998083895e-06,
+      "loss": 0.138,
+      "step": 1140
+    },
+    {
+      "epoch": 3.647527910685805,
+      "grad_norm": 1.1944586038589478,
+      "learning_rate": 5.647652972118998e-06,
+      "loss": 0.1339,
+      "step": 1145
+    },
+    {
+      "epoch": 3.66347687400319,
+      "grad_norm": 1.0332287549972534,
+      "learning_rate": 5.524623153297183e-06,
+      "loss": 0.1287,
+      "step": 1150
+    },
+    {
+      "epoch": 3.679425837320574,
+      "grad_norm": 1.0928303003311157,
+      "learning_rate": 5.402644998315609e-06,
+      "loss": 0.1262,
+      "step": 1155
+    },
+    {
+      "epoch": 3.6953748006379588,
+      "grad_norm": 1.2543113231658936,
+      "learning_rate": 5.281732045423664e-06,
+      "loss": 0.1443,
+      "step": 1160
+    },
+    {
+      "epoch": 3.7113237639553427,
+      "grad_norm": 0.9908517003059387,
+      "learning_rate": 5.1618977146449e-06,
+      "loss": 0.1091,
+      "step": 1165
+    },
+    {
+      "epoch": 3.7272727272727275,
+      "grad_norm": 1.0878533124923706,
+      "learning_rate": 5.04315530628752e-06,
+      "loss": 0.1064,
+      "step": 1170
+    },
+    {
+      "epoch": 3.7432216905901115,
+      "grad_norm": 1.1831837892532349,
+      "learning_rate": 4.925517999468232e-06,
+      "loss": 0.1234,
+      "step": 1175
+    },
+    {
+      "epoch": 3.7591706539074963,
+      "grad_norm": 0.9875324368476868,
+      "learning_rate": 4.808998850649456e-06,
+      "loss": 0.1181,
+      "step": 1180
+    },
+    {
+      "epoch": 3.77511961722488,
+      "grad_norm": 1.011039137840271,
+      "learning_rate": 4.693610792190252e-06,
+      "loss": 0.1121,
+      "step": 1185
+    },
+    {
+      "epoch": 3.7910685805422646,
+      "grad_norm": 1.1317858695983887,
+      "learning_rate": 4.579366630910923e-06,
+      "loss": 0.116,
+      "step": 1190
+    },
+    {
+      "epoch": 3.807017543859649,
+      "grad_norm": 0.919337272644043,
+      "learning_rate": 4.466279046671637e-06,
+      "loss": 0.0911,
+      "step": 1195
+    },
+    {
+      "epoch": 3.8229665071770333,
+      "grad_norm": 1.0885950326919556,
+      "learning_rate": 4.3543605909650676e-06,
+      "loss": 0.1339,
+      "step": 1200
+    },
+    {
+      "epoch": 3.8389154704944177,
+      "grad_norm": 0.9658224582672119,
+      "learning_rate": 4.243623685523341e-06,
+      "loss": 0.1289,
+      "step": 1205
+    },
+    {
+      "epoch": 3.854864433811802,
+      "grad_norm": 0.9771375060081482,
+      "learning_rate": 4.134080620939325e-06,
+      "loss": 0.1214,
+      "step": 1210
+    },
+    {
+      "epoch": 3.8708133971291865,
+      "grad_norm": 0.993411123752594,
+      "learning_rate": 4.025743555302564e-06,
+      "loss": 0.1117,
+      "step": 1215
+    },
+    {
+      "epoch": 3.886762360446571,
+      "grad_norm": 1.0799087285995483,
+      "learning_rate": 3.918624512849791e-06,
+      "loss": 0.1062,
+      "step": 1220
+    },
+    {
+      "epoch": 3.9027113237639552,
+      "grad_norm": 1.17735755443573,
+      "learning_rate": 3.8127353826304303e-06,
+      "loss": 0.1354,
+      "step": 1225
+    },
+    {
+      "epoch": 3.9186602870813396,
+      "grad_norm": 0.9857722520828247,
+      "learning_rate": 3.7080879171869967e-06,
+      "loss": 0.1194,
+      "step": 1230
+    },
+    {
+      "epoch": 3.934609250398724,
+      "grad_norm": 1.1036955118179321,
+      "learning_rate": 3.6046937312507296e-06,
+      "loss": 0.1201,
+      "step": 1235
+    },
+    {
+      "epoch": 3.9505582137161084,
+      "grad_norm": 1.0318442583084106,
+      "learning_rate": 3.5025643004524467e-06,
+      "loss": 0.1114,
+      "step": 1240
+    },
+    {
+      "epoch": 3.9665071770334928,
+      "grad_norm": 0.9945523738861084,
+      "learning_rate": 3.4017109600489068e-06,
+      "loss": 0.0964,
+      "step": 1245
+    },
+    {
+      "epoch": 3.982456140350877,
+      "grad_norm": 1.079817533493042,
+      "learning_rate": 3.302144903664698e-06,
+      "loss": 0.1193,
+      "step": 1250
+    },
+    {
+      "epoch": 3.9984051036682615,
+      "grad_norm": 1.1149638891220093,
+      "learning_rate": 3.2038771820498834e-06,
+      "loss": 0.1272,
+      "step": 1255
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1570,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.7098990387755745e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

13_128_e5_3e-5/checkpoint-1256/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91232031982acf747326dd0572af7262d829f9122bf0cae34fb47e91305f6c59
+size 7736

13_128_e5_3e-5/checkpoint-1256/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1256/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

13_128_e5_3e-5/checkpoint-1570/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

13_128_e5_3e-5/checkpoint-1570/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "down_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "k_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

13_128_e5_3e-5/checkpoint-1570/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:80a50cb86d635275fa121fa936d141f65435c382c76b81ba969edf9bde89801e
+size 791751704

13_128_e5_3e-5/checkpoint-1570/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1570

13_128_e5_3e-5/checkpoint-1570/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1570/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c6b568cb7fa80c988a4056bab065129c835a25c47c27ace628591e5c59af3da
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bfd4abf1aaec31bc2e690cda691b6dee3b1d99dc46b450e5ba1b7df115e393a4
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:32bf07c12076b2d766153f468502f11f53b8a9e461dde472c42ed645f7cb591d
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e107c5283337761bc4c1d5cf82efb1be3ce79277f558e3adad7563a15f5a5fa
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8dbff265c60c51d4840005714694ca4d4cb66c697427e39a24b6f95afc8eca01
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0370f210f3e3e670a503a6b2598298ca4c912f2209276475a66649600300e2e
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f5e5b7ac07813f17b4cbb647207c8fbffd91c1f091e2106fc1cdcb2d526babd
+size 15920

13_128_e5_3e-5/checkpoint-1570/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0082c35690b7675847ca3608064ff7775d9b087219a7e34452de7c8c23f4cc9
+size 15920

13_128_e5_3e-5/checkpoint-1570/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b896c44fa9a9f47be891047169b156eb8d44046e8177aef9082218ae3571341
+size 1064

13_128_e5_3e-5/checkpoint-1570/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

13_128_e5_3e-5/checkpoint-1570/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1570/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

13_128_e5_3e-5/checkpoint-1570/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2232 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 5.0,
+  "eval_steps": 500,
+  "global_step": 1570,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.01594896331738437,
+      "grad_norm": 1.6145861148834229,
+      "learning_rate": 1.518987341772152e-06,
+      "loss": 1.43,
+      "step": 5
+    },
+    {
+      "epoch": 0.03189792663476874,
+      "grad_norm": 1.0209437608718872,
+      "learning_rate": 3.4177215189873417e-06,
+      "loss": 1.3954,
+      "step": 10
+    },
+    {
+      "epoch": 0.04784688995215311,
+      "grad_norm": 0.7023665308952332,
+      "learning_rate": 5.3164556962025316e-06,
+      "loss": 1.412,
+      "step": 15
+    },
+    {
+      "epoch": 0.06379585326953748,
+      "grad_norm": 0.6941770911216736,
+      "learning_rate": 7.215189873417722e-06,
+      "loss": 1.3573,
+      "step": 20
+    },
+    {
+      "epoch": 0.07974481658692185,
+      "grad_norm": 0.5891399383544922,
+      "learning_rate": 9.113924050632912e-06,
+      "loss": 1.3281,
+      "step": 25
+    },
+    {
+      "epoch": 0.09569377990430622,
+      "grad_norm": 0.6663492321968079,
+      "learning_rate": 1.1012658227848103e-05,
+      "loss": 1.3214,
+      "step": 30
+    },
+    {
+      "epoch": 0.11164274322169059,
+      "grad_norm": 0.4784772992134094,
+      "learning_rate": 1.2911392405063291e-05,
+      "loss": 1.2544,
+      "step": 35
+    },
+    {
+      "epoch": 0.12759170653907495,
+      "grad_norm": 0.5382697582244873,
+      "learning_rate": 1.4810126582278482e-05,
+      "loss": 1.2353,
+      "step": 40
+    },
+    {
+      "epoch": 0.14354066985645933,
+      "grad_norm": 0.47352251410484314,
+      "learning_rate": 1.670886075949367e-05,
+      "loss": 1.2898,
+      "step": 45
+    },
+    {
+      "epoch": 0.1594896331738437,
+      "grad_norm": 0.4318561851978302,
+      "learning_rate": 1.860759493670886e-05,
+      "loss": 1.2872,
+      "step": 50
+    },
+    {
+      "epoch": 0.17543859649122806,
+      "grad_norm": 0.4748729467391968,
+      "learning_rate": 2.050632911392405e-05,
+      "loss": 1.2466,
+      "step": 55
+    },
+    {
+      "epoch": 0.19138755980861244,
+      "grad_norm": 0.4512109160423279,
+      "learning_rate": 2.240506329113924e-05,
+      "loss": 1.2634,
+      "step": 60
+    },
+    {
+      "epoch": 0.20733652312599682,
+      "grad_norm": 0.4634909927845001,
+      "learning_rate": 2.430379746835443e-05,
+      "loss": 1.2455,
+      "step": 65
+    },
+    {
+      "epoch": 0.22328548644338117,
+      "grad_norm": 0.5010442733764648,
+      "learning_rate": 2.620253164556962e-05,
+      "loss": 1.192,
+      "step": 70
+    },
+    {
+      "epoch": 0.23923444976076555,
+      "grad_norm": 0.581072211265564,
+      "learning_rate": 2.8101265822784812e-05,
+      "loss": 1.1886,
+      "step": 75
+    },
+    {
+      "epoch": 0.2551834130781499,
+      "grad_norm": 0.4782950282096863,
+      "learning_rate": 3e-05,
+      "loss": 1.1987,
+      "step": 80
+    },
+    {
+      "epoch": 0.2711323763955343,
+      "grad_norm": 0.583503782749176,
+      "learning_rate": 2.999916758151899e-05,
+      "loss": 1.1802,
+      "step": 85
+    },
+    {
+      "epoch": 0.28708133971291866,
+      "grad_norm": 0.48859143257141113,
+      "learning_rate": 2.999667041846535e-05,
+      "loss": 1.1947,
+      "step": 90
+    },
+    {
+      "epoch": 0.30303030303030304,
+      "grad_norm": 0.6850916743278503,
+      "learning_rate": 2.9992508787997044e-05,
+      "loss": 1.1564,
+      "step": 95
+    },
+    {
+      "epoch": 0.3189792663476874,
+      "grad_norm": 0.5491988658905029,
+      "learning_rate": 2.9986683152009822e-05,
+      "loss": 1.1004,
+      "step": 100
+    },
+    {
+      "epoch": 0.3349282296650718,
+      "grad_norm": 0.7356356382369995,
+      "learning_rate": 2.997919415708596e-05,
+      "loss": 1.0898,
+      "step": 105
+    },
+    {
+      "epoch": 0.3508771929824561,
+      "grad_norm": 0.5359194874763489,
+      "learning_rate": 2.9970042634422484e-05,
+      "loss": 1.1407,
+      "step": 110
+    },
+    {
+      "epoch": 0.3668261562998405,
+      "grad_norm": 0.7027154564857483,
+      "learning_rate": 2.995922959973895e-05,
+      "loss": 1.0929,
+      "step": 115
+    },
+    {
+      "epoch": 0.3827751196172249,
+      "grad_norm": 0.6330419182777405,
+      "learning_rate": 2.994675625316468e-05,
+      "loss": 1.0657,
+      "step": 120
+    },
+    {
+      "epoch": 0.39872408293460926,
+      "grad_norm": 0.6170833110809326,
+      "learning_rate": 2.9932623979105558e-05,
+      "loss": 1.0718,
+      "step": 125
+    },
+    {
+      "epoch": 0.41467304625199364,
+      "grad_norm": 0.6576064229011536,
+      "learning_rate": 2.9916834346090406e-05,
+      "loss": 1.0926,
+      "step": 130
+    },
+    {
+      "epoch": 0.430622009569378,
+      "grad_norm": 0.5628703832626343,
+      "learning_rate": 2.9899389106596867e-05,
+      "loss": 0.9873,
+      "step": 135
+    },
+    {
+      "epoch": 0.44657097288676234,
+      "grad_norm": 0.6521088480949402,
+      "learning_rate": 2.9880290196856913e-05,
+      "loss": 1.0647,
+      "step": 140
+    },
+    {
+      "epoch": 0.4625199362041467,
+      "grad_norm": 0.7059735059738159,
+      "learning_rate": 2.9859539736641926e-05,
+      "loss": 1.0348,
+      "step": 145
+    },
+    {
+      "epoch": 0.4784688995215311,
+      "grad_norm": 0.6659881472587585,
+      "learning_rate": 2.983714002902746e-05,
+      "loss": 1.035,
+      "step": 150
+    },
+    {
+      "epoch": 0.4944178628389155,
+      "grad_norm": 0.6957710981369019,
+      "learning_rate": 2.9813093560137577e-05,
+      "loss": 1.0298,
+      "step": 155
+    },
+    {
+      "epoch": 0.5103668261562998,
+      "grad_norm": 0.7887131571769714,
+      "learning_rate": 2.978740299886898e-05,
+      "loss": 0.9958,
+      "step": 160
+    },
+    {
+      "epoch": 0.5263157894736842,
+      "grad_norm": 0.8920831084251404,
+      "learning_rate": 2.9760071196594715e-05,
+      "loss": 1.0141,
+      "step": 165
+    },
+    {
+      "epoch": 0.5422647527910686,
+      "grad_norm": 0.9099592566490173,
+      "learning_rate": 2.973110118684777e-05,
+      "loss": 0.9644,
+      "step": 170
+    },
+    {
+      "epoch": 0.5582137161084529,
+      "grad_norm": 0.9702492356300354,
+      "learning_rate": 2.970049618498434e-05,
+      "loss": 0.9158,
+      "step": 175
+    },
+    {
+      "epoch": 0.5741626794258373,
+      "grad_norm": 0.8628178834915161,
+      "learning_rate": 2.9668259587826984e-05,
+      "loss": 0.934,
+      "step": 180
+    },
+    {
+      "epoch": 0.5901116427432217,
+      "grad_norm": 0.7894458174705505,
+      "learning_rate": 2.9634394973287605e-05,
+      "loss": 0.9344,
+      "step": 185
+    },
+    {
+      "epoch": 0.6060606060606061,
+      "grad_norm": 0.7742944955825806,
+      "learning_rate": 2.9598906099970324e-05,
+      "loss": 0.9333,
+      "step": 190
+    },
+    {
+      "epoch": 0.6220095693779905,
+      "grad_norm": 0.8715435266494751,
+      "learning_rate": 2.956179690675435e-05,
+      "loss": 0.8983,
+      "step": 195
+    },
+    {
+      "epoch": 0.6379585326953748,
+      "grad_norm": 0.9087679386138916,
+      "learning_rate": 2.9523071512356785e-05,
+      "loss": 0.8283,
+      "step": 200
+    },
+    {
+      "epoch": 0.6539074960127592,
+      "grad_norm": 0.8821738958358765,
+      "learning_rate": 2.9482734214875492e-05,
+      "loss": 0.9741,
+      "step": 205
+    },
+    {
+      "epoch": 0.6698564593301436,
+      "grad_norm": 0.8856318593025208,
+      "learning_rate": 2.9440789491312053e-05,
+      "loss": 0.8902,
+      "step": 210
+    },
+    {
+      "epoch": 0.6858054226475279,
+      "grad_norm": 0.9013231992721558,
+      "learning_rate": 2.9397241997074885e-05,
+      "loss": 0.9095,
+      "step": 215
+    },
+    {
+      "epoch": 0.7017543859649122,
+      "grad_norm": 0.9037012457847595,
+      "learning_rate": 2.9352096565462518e-05,
+      "loss": 0.8663,
+      "step": 220
+    },
+    {
+      "epoch": 0.7177033492822966,
+      "grad_norm": 0.9097235798835754,
+      "learning_rate": 2.9305358207127163e-05,
+      "loss": 0.8667,
+      "step": 225
+    },
+    {
+      "epoch": 0.733652312599681,
+      "grad_norm": 0.8799058198928833,
+      "learning_rate": 2.9257032109518594e-05,
+      "loss": 0.8142,
+      "step": 230
+    },
+    {
+      "epoch": 0.7496012759170654,
+      "grad_norm": 1.0260847806930542,
+      "learning_rate": 2.9207123636308372e-05,
+      "loss": 0.8293,
+      "step": 235
+    },
+    {
+      "epoch": 0.7655502392344498,
+      "grad_norm": 0.9114710688591003,
+      "learning_rate": 2.9155638326794564e-05,
+      "loss": 0.8146,
+      "step": 240
+    },
+    {
+      "epoch": 0.7814992025518341,
+      "grad_norm": 0.9736585021018982,
+      "learning_rate": 2.9102581895286923e-05,
+      "loss": 0.8152,
+      "step": 245
+    },
+    {
+      "epoch": 0.7974481658692185,
+      "grad_norm": 1.001495361328125,
+      "learning_rate": 2.9047960230472655e-05,
+      "loss": 0.869,
+      "step": 250
+    },
+    {
+      "epoch": 0.8133971291866029,
+      "grad_norm": 1.0725624561309814,
+      "learning_rate": 2.8991779394762875e-05,
+      "loss": 0.8488,
+      "step": 255
+    },
+    {
+      "epoch": 0.8293460925039873,
+      "grad_norm": 1.0831763744354248,
+      "learning_rate": 2.8934045623619697e-05,
+      "loss": 0.809,
+      "step": 260
+    },
+    {
+      "epoch": 0.8452950558213717,
+      "grad_norm": 1.060348629951477,
+      "learning_rate": 2.88747653248642e-05,
+      "loss": 0.8019,
+      "step": 265
+    },
+    {
+      "epoch": 0.861244019138756,
+      "grad_norm": 1.001414179801941,
+      "learning_rate": 2.8813945077965217e-05,
+      "loss": 0.7483,
+      "step": 270
+    },
+    {
+      "epoch": 0.8771929824561403,
+      "grad_norm": 1.0342857837677002,
+      "learning_rate": 2.875159163330909e-05,
+      "loss": 0.7767,
+      "step": 275
+    },
+    {
+      "epoch": 0.8931419457735247,
+      "grad_norm": 1.079073190689087,
+      "learning_rate": 2.8687711911450436e-05,
+      "loss": 0.7291,
+      "step": 280
+    },
+    {
+      "epoch": 0.9090909090909091,
+      "grad_norm": 1.0348238945007324,
+      "learning_rate": 2.862231300234407e-05,
+      "loss": 0.7566,
+      "step": 285
+    },
+    {
+      "epoch": 0.9250398724082934,
+      "grad_norm": 0.9865140318870544,
+      "learning_rate": 2.8555402164558058e-05,
+      "loss": 0.7365,
+      "step": 290
+    },
+    {
+      "epoch": 0.9409888357256778,
+      "grad_norm": 0.9957141280174255,
+      "learning_rate": 2.8486986824468134e-05,
+      "loss": 0.7394,
+      "step": 295
+    },
+    {
+      "epoch": 0.9569377990430622,
+      "grad_norm": 1.2500948905944824,
+      "learning_rate": 2.841707457543343e-05,
+      "loss": 0.7223,
+      "step": 300
+    },
+    {
+      "epoch": 0.9728867623604466,
+      "grad_norm": 1.0007729530334473,
+      "learning_rate": 2.8345673176953692e-05,
+      "loss": 0.7563,
+      "step": 305
+    },
+    {
+      "epoch": 0.988835725677831,
+      "grad_norm": 1.1462596654891968,
+      "learning_rate": 2.8272790553808082e-05,
+      "loss": 0.7193,
+      "step": 310
+    },
+    {
+      "epoch": 1.003189792663477,
+      "grad_norm": 1.1253020763397217,
+      "learning_rate": 2.8198434795175585e-05,
+      "loss": 0.6737,
+      "step": 315
+    },
+    {
+      "epoch": 1.0191387559808613,
+      "grad_norm": 1.070001482963562,
+      "learning_rate": 2.8122614153737228e-05,
+      "loss": 0.6186,
+      "step": 320
+    },
+    {
+      "epoch": 1.0350877192982457,
+      "grad_norm": 1.606988549232483,
+      "learning_rate": 2.8045337044760103e-05,
+      "loss": 0.6471,
+      "step": 325
+    },
+    {
+      "epoch": 1.0510366826156299,
+      "grad_norm": 1.4467155933380127,
+      "learning_rate": 2.7966612045163363e-05,
+      "loss": 0.6399,
+      "step": 330
+    },
+    {
+      "epoch": 1.0669856459330143,
+      "grad_norm": 1.2260429859161377,
+      "learning_rate": 2.7886447892566284e-05,
+      "loss": 0.6327,
+      "step": 335
+    },
+    {
+      "epoch": 1.0829346092503986,
+      "grad_norm": 0.9376658797264099,
+      "learning_rate": 2.7804853484318488e-05,
+      "loss": 0.6429,
+      "step": 340
+    },
+    {
+      "epoch": 1.098883572567783,
+      "grad_norm": 1.349844217300415,
+      "learning_rate": 2.7721837876512425e-05,
+      "loss": 0.6507,
+      "step": 345
+    },
+    {
+      "epoch": 1.1148325358851674,
+      "grad_norm": 1.0897091627120972,
+      "learning_rate": 2.763741028297824e-05,
+      "loss": 0.6087,
+      "step": 350
+    },
+    {
+      "epoch": 1.1307814992025518,
+      "grad_norm": 1.189454436302185,
+      "learning_rate": 2.755158007426116e-05,
+      "loss": 0.593,
+      "step": 355
+    },
+    {
+      "epoch": 1.1467304625199362,
+      "grad_norm": 1.334564447402954,
+      "learning_rate": 2.746435677658146e-05,
+      "loss": 0.5689,
+      "step": 360
+    },
+    {
+      "epoch": 1.1626794258373205,
+      "grad_norm": 1.5465506315231323,
+      "learning_rate": 2.7375750070777114e-05,
+      "loss": 0.5908,
+      "step": 365
+    },
+    {
+      "epoch": 1.178628389154705,
+      "grad_norm": 1.5634928941726685,
+      "learning_rate": 2.7285769791229394e-05,
+      "loss": 0.5839,
+      "step": 370
+    },
+    {
+      "epoch": 1.1945773524720893,
+      "grad_norm": 1.276563048362732,
+      "learning_rate": 2.7194425924771317e-05,
+      "loss": 0.5979,
+      "step": 375
+    },
+    {
+      "epoch": 1.2105263157894737,
+      "grad_norm": 1.5907806158065796,
+      "learning_rate": 2.7101728609579216e-05,
+      "loss": 0.567,
+      "step": 380
+    },
+    {
+      "epoch": 1.226475279106858,
+      "grad_norm": 1.1192567348480225,
+      "learning_rate": 2.700768813404754e-05,
+      "loss": 0.5264,
+      "step": 385
+    },
+    {
+      "epoch": 1.2424242424242424,
+      "grad_norm": 1.3511794805526733,
+      "learning_rate": 2.691231493564693e-05,
+      "loss": 0.5866,
+      "step": 390
+    },
+    {
+      "epoch": 1.2583732057416268,
+      "grad_norm": 1.1103533506393433,
+      "learning_rate": 2.6815619599765775e-05,
+      "loss": 0.5649,
+      "step": 395
+    },
+    {
+      "epoch": 1.2743221690590112,
+      "grad_norm": 1.3668354749679565,
+      "learning_rate": 2.6717612858535356e-05,
+      "loss": 0.5675,
+      "step": 400
+    },
+    {
+      "epoch": 1.2902711323763956,
+      "grad_norm": 1.1439465284347534,
+      "learning_rate": 2.6618305589638695e-05,
+      "loss": 0.5416,
+      "step": 405
+    },
+    {
+      "epoch": 1.30622009569378,
+      "grad_norm": 1.1497900485992432,
+      "learning_rate": 2.651770881510325e-05,
+      "loss": 0.5459,
+      "step": 410
+    },
+    {
+      "epoch": 1.3221690590111643,
+      "grad_norm": 1.195949673652649,
+      "learning_rate": 2.641583370007759e-05,
+      "loss": 0.5699,
+      "step": 415
+    },
+    {
+      "epoch": 1.3381180223285487,
+      "grad_norm": 1.229684829711914,
+      "learning_rate": 2.6312691551592177e-05,
+      "loss": 0.5608,
+      "step": 420
+    },
+    {
+      "epoch": 1.354066985645933,
+      "grad_norm": 1.300110101699829,
+      "learning_rate": 2.620829381730443e-05,
+      "loss": 0.5245,
+      "step": 425
+    },
+    {
+      "epoch": 1.3700159489633175,
+      "grad_norm": 1.047258734703064,
+      "learning_rate": 2.6102652084228125e-05,
+      "loss": 0.5543,
+      "step": 430
+    },
+    {
+      "epoch": 1.3859649122807016,
+      "grad_norm": 1.314669132232666,
+      "learning_rate": 2.5995778077447393e-05,
+      "loss": 0.5681,
+      "step": 435
+    },
+    {
+      "epoch": 1.401913875598086,
+      "grad_norm": 1.230797290802002,
+      "learning_rate": 2.5887683658815358e-05,
+      "loss": 0.5202,
+      "step": 440
+    },
+    {
+      "epoch": 1.4178628389154704,
+      "grad_norm": 1.1360483169555664,
+      "learning_rate": 2.5778380825637592e-05,
+      "loss": 0.5447,
+      "step": 445
+    },
+    {
+      "epoch": 1.4338118022328548,
+      "grad_norm": 1.2857158184051514,
+      "learning_rate": 2.5667881709340532e-05,
+      "loss": 0.5267,
+      "step": 450
+    },
+    {
+      "epoch": 1.4497607655502391,
+      "grad_norm": 1.1709908246994019,
+      "learning_rate": 2.5556198574125053e-05,
+      "loss": 0.4973,
+      "step": 455
+    },
+    {
+      "epoch": 1.4657097288676235,
+      "grad_norm": 1.2642844915390015,
+      "learning_rate": 2.5443343815605262e-05,
+      "loss": 0.5027,
+      "step": 460
+    },
+    {
+      "epoch": 1.481658692185008,
+      "grad_norm": 1.2073395252227783,
+      "learning_rate": 2.532932995943272e-05,
+      "loss": 0.4725,
+      "step": 465
+    },
+    {
+      "epoch": 1.4976076555023923,
+      "grad_norm": 1.2468843460083008,
+      "learning_rate": 2.5214169659906207e-05,
+      "loss": 0.497,
+      "step": 470
+    },
+    {
+      "epoch": 1.5135566188197767,
+      "grad_norm": 1.1961091756820679,
+      "learning_rate": 2.509787569856728e-05,
+      "loss": 0.5276,
+      "step": 475
+    },
+    {
+      "epoch": 1.529505582137161,
+      "grad_norm": 1.1429219245910645,
+      "learning_rate": 2.4980460982781625e-05,
+      "loss": 0.492,
+      "step": 480
+    },
+    {
+      "epoch": 1.5454545454545454,
+      "grad_norm": 1.090941309928894,
+      "learning_rate": 2.486193854430649e-05,
+      "loss": 0.5089,
+      "step": 485
+    },
+    {
+      "epoch": 1.5614035087719298,
+      "grad_norm": 1.1492680311203003,
+      "learning_rate": 2.4742321537844305e-05,
+      "loss": 0.4591,
+      "step": 490
+    },
+    {
+      "epoch": 1.5773524720893142,
+      "grad_norm": 1.1921738386154175,
+      "learning_rate": 2.4621623239582637e-05,
+      "loss": 0.4677,
+      "step": 495
+    },
+    {
+      "epoch": 1.5933014354066986,
+      "grad_norm": 1.3825076818466187,
+      "learning_rate": 2.4499857045720705e-05,
+      "loss": 0.4611,
+      "step": 500
+    },
+    {
+      "epoch": 1.609250398724083,
+      "grad_norm": 1.1410709619522095,
+      "learning_rate": 2.437703647098253e-05,
+      "loss": 0.485,
+      "step": 505
+    },
+    {
+      "epoch": 1.6251993620414673,
+      "grad_norm": 1.227121353149414,
+      "learning_rate": 2.4253175147116943e-05,
+      "loss": 0.4807,
+      "step": 510
+    },
+    {
+      "epoch": 1.6411483253588517,
+      "grad_norm": 1.276689052581787,
+      "learning_rate": 2.4128286821384616e-05,
+      "loss": 0.448,
+      "step": 515
+    },
+    {
+      "epoch": 1.657097288676236,
+      "grad_norm": 1.0904544591903687,
+      "learning_rate": 2.400238535503228e-05,
+      "loss": 0.4679,
+      "step": 520
+    },
+    {
+      "epoch": 1.6730462519936204,
+      "grad_norm": 1.2670605182647705,
+      "learning_rate": 2.3875484721754245e-05,
+      "loss": 0.4675,
+      "step": 525
+    },
+    {
+      "epoch": 1.6889952153110048,
+      "grad_norm": 1.2647444009780884,
+      "learning_rate": 2.3747599006141497e-05,
+      "loss": 0.4567,
+      "step": 530
+    },
+    {
+      "epoch": 1.7049441786283892,
+      "grad_norm": 1.166435718536377,
+      "learning_rate": 2.3618742402118452e-05,
+      "loss": 0.4699,
+      "step": 535
+    },
+    {
+      "epoch": 1.7208931419457736,
+      "grad_norm": 1.3012378215789795,
+      "learning_rate": 2.3488929211367596e-05,
+      "loss": 0.45,
+      "step": 540
+    },
+    {
+      "epoch": 1.736842105263158,
+      "grad_norm": 1.3723088502883911,
+      "learning_rate": 2.3358173841742128e-05,
+      "loss": 0.3948,
+      "step": 545
+    },
+    {
+      "epoch": 1.7527910685805423,
+      "grad_norm": 1.2429518699645996,
+      "learning_rate": 2.3226490805666875e-05,
+      "loss": 0.3917,
+      "step": 550
+    },
+    {
+      "epoch": 1.7687400318979267,
+      "grad_norm": 1.4225772619247437,
+      "learning_rate": 2.3093894718527552e-05,
+      "loss": 0.438,
+      "step": 555
+    },
+    {
+      "epoch": 1.784688995215311,
+      "grad_norm": 1.1522200107574463,
+      "learning_rate": 2.2960400297048618e-05,
+      "loss": 0.405,
+      "step": 560
+    },
+    {
+      "epoch": 1.8006379585326955,
+      "grad_norm": 1.2812358140945435,
+      "learning_rate": 2.282602235765988e-05,
+      "loss": 0.4437,
+      "step": 565
+    },
+    {
+      "epoch": 1.8165869218500799,
+      "grad_norm": 1.1189430952072144,
+      "learning_rate": 2.2690775814852032e-05,
+      "loss": 0.4517,
+      "step": 570
+    },
+    {
+      "epoch": 1.8325358851674642,
+      "grad_norm": 1.174291729927063,
+      "learning_rate": 2.25546756795213e-05,
+      "loss": 0.3906,
+      "step": 575
+    },
+    {
+      "epoch": 1.8484848484848486,
+      "grad_norm": 1.1886178255081177,
+      "learning_rate": 2.241773705730341e-05,
+      "loss": 0.4126,
+      "step": 580
+    },
+    {
+      "epoch": 1.864433811802233,
+      "grad_norm": 1.2971218824386597,
+      "learning_rate": 2.2279975146897016e-05,
+      "loss": 0.44,
+      "step": 585
+    },
+    {
+      "epoch": 1.8803827751196174,
+      "grad_norm": 1.4141205549240112,
+      "learning_rate": 2.21414052383768e-05,
+      "loss": 0.3978,
+      "step": 590
+    },
+    {
+      "epoch": 1.8963317384370018,
+      "grad_norm": 1.3251820802688599,
+      "learning_rate": 2.2002042711496483e-05,
+      "loss": 0.3748,
+      "step": 595
+    },
+    {
+      "epoch": 1.912280701754386,
+      "grad_norm": 1.2246263027191162,
+      "learning_rate": 2.1861903033981772e-05,
+      "loss": 0.404,
+      "step": 600
+    },
+    {
+      "epoch": 1.9282296650717703,
+      "grad_norm": 1.3702830076217651,
+      "learning_rate": 2.1721001759813677e-05,
+      "loss": 0.3759,
+      "step": 605
+    },
+    {
+      "epoch": 1.9441786283891547,
+      "grad_norm": 1.238807201385498,
+      "learning_rate": 2.157935452750214e-05,
+      "loss": 0.3727,
+      "step": 610
+    },
+    {
+      "epoch": 1.960127591706539,
+      "grad_norm": 1.403578281402588,
+      "learning_rate": 2.1436977058350364e-05,
+      "loss": 0.3812,
+      "step": 615
+    },
+    {
+      "epoch": 1.9760765550239234,
+      "grad_norm": 1.1821719408035278,
+      "learning_rate": 2.1293885154709885e-05,
+      "loss": 0.4449,
+      "step": 620
+    },
+    {
+      "epoch": 1.9920255183413078,
+      "grad_norm": 1.2228764295578003,
+      "learning_rate": 2.115009469822672e-05,
+      "loss": 0.3727,
+      "step": 625
+    },
+    {
+      "epoch": 2.006379585326954,
+      "grad_norm": 1.0864733457565308,
+      "learning_rate": 2.100562164807865e-05,
+      "loss": 0.349,
+      "step": 630
+    },
+    {
+      "epoch": 2.0223285486443383,
+      "grad_norm": 1.207122802734375,
+      "learning_rate": 2.0860482039203933e-05,
+      "loss": 0.3144,
+      "step": 635
+    },
+    {
+      "epoch": 2.0382775119617227,
+      "grad_norm": 1.2051016092300415,
+      "learning_rate": 2.071469198052161e-05,
+      "loss": 0.2479,
+      "step": 640
+    },
+    {
+      "epoch": 2.054226475279107,
+      "grad_norm": 1.2147462368011475,
+      "learning_rate": 2.0568267653143566e-05,
+      "loss": 0.3296,
+      "step": 645
+    },
+    {
+      "epoch": 2.0701754385964914,
+      "grad_norm": 1.198175311088562,
+      "learning_rate": 2.0421225308578628e-05,
+      "loss": 0.2963,
+      "step": 650
+    },
+    {
+      "epoch": 2.0861244019138754,
+      "grad_norm": 1.1988497972488403,
+      "learning_rate": 2.027358126692881e-05,
+      "loss": 0.2827,
+      "step": 655
+    },
+    {
+      "epoch": 2.1020733652312598,
+      "grad_norm": 1.2769469022750854,
+      "learning_rate": 2.0125351915077965e-05,
+      "loss": 0.2925,
+      "step": 660
+    },
+    {
+      "epoch": 2.118022328548644,
+      "grad_norm": 1.300974726676941,
+      "learning_rate": 1.9976553704873008e-05,
+      "loss": 0.3019,
+      "step": 665
+    },
+    {
+      "epoch": 2.1339712918660285,
+      "grad_norm": 1.2205158472061157,
+      "learning_rate": 1.982720315129796e-05,
+      "loss": 0.3351,
+      "step": 670
+    },
+    {
+      "epoch": 2.149920255183413,
+      "grad_norm": 1.288529872894287,
+      "learning_rate": 1.9677316830640948e-05,
+      "loss": 0.2885,
+      "step": 675
+    },
+    {
+      "epoch": 2.1658692185007973,
+      "grad_norm": 1.2325316667556763,
+      "learning_rate": 1.952691137865441e-05,
+      "loss": 0.301,
+      "step": 680
+    },
+    {
+      "epoch": 2.1818181818181817,
+      "grad_norm": 1.3443962335586548,
+      "learning_rate": 1.9376003488708748e-05,
+      "loss": 0.2753,
+      "step": 685
+    },
+    {
+      "epoch": 2.197767145135566,
+      "grad_norm": 1.3362351655960083,
+      "learning_rate": 1.9224609909939486e-05,
+      "loss": 0.2823,
+      "step": 690
+    },
+    {
+      "epoch": 2.2137161084529504,
+      "grad_norm": 1.251630187034607,
+      "learning_rate": 1.907274744538834e-05,
+      "loss": 0.3052,
+      "step": 695
+    },
+    {
+      "epoch": 2.229665071770335,
+      "grad_norm": 1.3313004970550537,
+      "learning_rate": 1.8920432950138257e-05,
+      "loss": 0.2703,
+      "step": 700
+    },
+    {
+      "epoch": 2.245614035087719,
+      "grad_norm": 1.3427534103393555,
+      "learning_rate": 1.876768332944267e-05,
+      "loss": 0.2677,
+      "step": 705
+    },
+    {
+      "epoch": 2.2615629984051036,
+      "grad_norm": 1.3929996490478516,
+      "learning_rate": 1.8614515536849215e-05,
+      "loss": 0.2697,
+      "step": 710
+    },
+    {
+      "epoch": 2.277511961722488,
+      "grad_norm": 1.151013731956482,
+      "learning_rate": 1.8460946572318055e-05,
+      "loss": 0.2988,
+      "step": 715
+    },
+    {
+      "epoch": 2.2934609250398723,
+      "grad_norm": 1.2603211402893066,
+      "learning_rate": 1.8306993480335078e-05,
+      "loss": 0.2714,
+      "step": 720
+    },
+    {
+      "epoch": 2.3094098883572567,
+      "grad_norm": 1.6505305767059326,
+      "learning_rate": 1.8152673348020155e-05,
+      "loss": 0.2818,
+      "step": 725
+    },
+    {
+      "epoch": 2.325358851674641,
+      "grad_norm": 1.1775126457214355,
+      "learning_rate": 1.7998003303230634e-05,
+      "loss": 0.2713,
+      "step": 730
+    },
+    {
+      "epoch": 2.3413078149920254,
+      "grad_norm": 1.44676673412323,
+      "learning_rate": 1.7843000512660344e-05,
+      "loss": 0.2759,
+      "step": 735
+    },
+    {
+      "epoch": 2.35725677830941,
+      "grad_norm": 1.4021360874176025,
+      "learning_rate": 1.7687682179934285e-05,
+      "loss": 0.2787,
+      "step": 740
+    },
+    {
+      "epoch": 2.373205741626794,
+      "grad_norm": 1.2734301090240479,
+      "learning_rate": 1.7532065543699202e-05,
+      "loss": 0.2672,
+      "step": 745
+    },
+    {
+      "epoch": 2.3891547049441786,
+      "grad_norm": 1.1749529838562012,
+      "learning_rate": 1.7376167875710296e-05,
+      "loss": 0.3058,
+      "step": 750
+    },
+    {
+      "epoch": 2.405103668261563,
+      "grad_norm": 1.1213788986206055,
+      "learning_rate": 1.7220006478914218e-05,
+      "loss": 0.2389,
+      "step": 755
+    },
+    {
+      "epoch": 2.4210526315789473,
+      "grad_norm": 1.0899239778518677,
+      "learning_rate": 1.7063598685528675e-05,
+      "loss": 0.2255,
+      "step": 760
+    },
+    {
+      "epoch": 2.4370015948963317,
+      "grad_norm": 1.2822694778442383,
+      "learning_rate": 1.6906961855118703e-05,
+      "loss": 0.2697,
+      "step": 765
+    },
+    {
+      "epoch": 2.452950558213716,
+      "grad_norm": 1.1576502323150635,
+      "learning_rate": 1.675011337266996e-05,
+      "loss": 0.2625,
+      "step": 770
+    },
+    {
+      "epoch": 2.4688995215311005,
+      "grad_norm": 1.1249899864196777,
+      "learning_rate": 1.6593070646659175e-05,
+      "loss": 0.2568,
+      "step": 775
+    },
+    {
+      "epoch": 2.484848484848485,
+      "grad_norm": 1.1156028509140015,
+      "learning_rate": 1.6435851107122013e-05,
+      "loss": 0.2426,
+      "step": 780
+    },
+    {
+      "epoch": 2.5007974481658692,
+      "grad_norm": 1.2244150638580322,
+      "learning_rate": 1.6278472203718512e-05,
+      "loss": 0.2313,
+      "step": 785
+    },
+    {
+      "epoch": 2.5167464114832536,
+      "grad_norm": 1.187371850013733,
+      "learning_rate": 1.6120951403796367e-05,
+      "loss": 0.2443,
+      "step": 790
+    },
+    {
+      "epoch": 2.532695374800638,
+      "grad_norm": 1.413248896598816,
+      "learning_rate": 1.5963306190452238e-05,
+      "loss": 0.2811,
+      "step": 795
+    },
+    {
+      "epoch": 2.5486443381180224,
+      "grad_norm": 1.2305434942245483,
+      "learning_rate": 1.5805554060591337e-05,
+      "loss": 0.2532,
+      "step": 800
+    },
+    {
+      "epoch": 2.5645933014354068,
+      "grad_norm": 1.2348670959472656,
+      "learning_rate": 1.5647712522985442e-05,
+      "loss": 0.2367,
+      "step": 805
+    },
+    {
+      "epoch": 2.580542264752791,
+      "grad_norm": 1.3350554704666138,
+      "learning_rate": 1.5489799096329607e-05,
+      "loss": 0.2482,
+      "step": 810
+    },
+    {
+      "epoch": 2.5964912280701755,
+      "grad_norm": 1.2583109140396118,
+      "learning_rate": 1.5331831307297803e-05,
+      "loss": 0.2505,
+      "step": 815
+    },
+    {
+      "epoch": 2.61244019138756,
+      "grad_norm": 1.2715870141983032,
+      "learning_rate": 1.5173826688597631e-05,
+      "loss": 0.2417,
+      "step": 820
+    },
+    {
+      "epoch": 2.6283891547049443,
+      "grad_norm": 1.3642032146453857,
+      "learning_rate": 1.5015802777024382e-05,
+      "loss": 0.2211,
+      "step": 825
+    },
+    {
+      "epoch": 2.6443381180223287,
+      "grad_norm": 1.260809063911438,
+      "learning_rate": 1.4857777111514646e-05,
+      "loss": 0.2222,
+      "step": 830
+    },
+    {
+      "epoch": 2.660287081339713,
+      "grad_norm": 1.310431718826294,
+      "learning_rate": 1.4699767231199683e-05,
+      "loss": 0.2419,
+      "step": 835
+    },
+    {
+      "epoch": 2.6762360446570974,
+      "grad_norm": 1.2949233055114746,
+      "learning_rate": 1.4541790673458762e-05,
+      "loss": 0.2281,
+      "step": 840
+    },
+    {
+      "epoch": 2.692185007974482,
+      "grad_norm": 1.227954626083374,
+      "learning_rate": 1.4383864971972724e-05,
+      "loss": 0.225,
+      "step": 845
+    },
+    {
+      "epoch": 2.708133971291866,
+      "grad_norm": 1.3981674909591675,
+      "learning_rate": 1.4226007654777903e-05,
+      "loss": 0.23,
+      "step": 850
+    },
+    {
+      "epoch": 2.7240829346092506,
+      "grad_norm": 1.227035641670227,
+      "learning_rate": 1.4068236242320728e-05,
+      "loss": 0.2424,
+      "step": 855
+    },
+    {
+      "epoch": 2.740031897926635,
+      "grad_norm": 1.1334996223449707,
+      "learning_rate": 1.3910568245513128e-05,
+      "loss": 0.2299,
+      "step": 860
+    },
+    {
+      "epoch": 2.7559808612440193,
+      "grad_norm": 1.2656800746917725,
+      "learning_rate": 1.3753021163789027e-05,
+      "loss": 0.2309,
+      "step": 865
+    },
+    {
+      "epoch": 2.7719298245614032,
+      "grad_norm": 1.0391762256622314,
+      "learning_rate": 1.3595612483162086e-05,
+      "loss": 0.2143,
+      "step": 870
+    },
+    {
+      "epoch": 2.787878787878788,
+      "grad_norm": 1.2347395420074463,
+      "learning_rate": 1.3438359674284941e-05,
+      "loss": 0.2235,
+      "step": 875
+    },
+    {
+      "epoch": 2.803827751196172,
+      "grad_norm": 1.329914927482605,
+      "learning_rate": 1.328128019051018e-05,
+      "loss": 0.2336,
+      "step": 880
+    },
+    {
+      "epoch": 2.819776714513557,
+      "grad_norm": 1.2936028242111206,
+      "learning_rate": 1.3124391465953164e-05,
+      "loss": 0.1938,
+      "step": 885
+    },
+    {
+      "epoch": 2.8357256778309408,
+      "grad_norm": 1.2581311464309692,
+      "learning_rate": 1.2967710913557067e-05,
+      "loss": 0.2341,
+      "step": 890
+    },
+    {
+      "epoch": 2.8516746411483256,
+      "grad_norm": 1.3273301124572754,
+      "learning_rate": 1.2811255923160212e-05,
+      "loss": 0.2234,
+      "step": 895
+    },
+    {
+      "epoch": 2.8676236044657095,
+      "grad_norm": 1.268049955368042,
+      "learning_rate": 1.2655043859565995e-05,
+      "loss": 0.23,
+      "step": 900
+    },
+    {
+      "epoch": 2.8835725677830943,
+      "grad_norm": 1.2851536273956299,
+      "learning_rate": 1.249909206061557e-05,
+      "loss": 0.2308,
+      "step": 905
+    },
+    {
+      "epoch": 2.8995215311004783,
+      "grad_norm": 1.1563853025436401,
+      "learning_rate": 1.2343417835263556e-05,
+      "loss": 0.2149,
+      "step": 910
+    },
+    {
+      "epoch": 2.915470494417863,
+      "grad_norm": 1.5606164932250977,
+      "learning_rate": 1.21880384616569e-05,
+      "loss": 0.2083,
+      "step": 915
+    },
+    {
+      "epoch": 2.931419457735247,
+      "grad_norm": 1.1263948678970337,
+      "learning_rate": 1.2032971185217241e-05,
+      "loss": 0.1733,
+      "step": 920
+    },
+    {
+      "epoch": 2.9473684210526314,
+      "grad_norm": 1.2153230905532837,
+      "learning_rate": 1.1878233216726798e-05,
+      "loss": 0.2001,
+      "step": 925
+    },
+    {
+      "epoch": 2.963317384370016,
+      "grad_norm": 1.3377314805984497,
+      "learning_rate": 1.1723841730418198e-05,
+      "loss": 0.2026,
+      "step": 930
+    },
+    {
+      "epoch": 2.9792663476874,
+      "grad_norm": 1.1396352052688599,
+      "learning_rate": 1.1569813862068307e-05,
+      "loss": 0.1944,
+      "step": 935
+    },
+    {
+      "epoch": 2.9952153110047846,
+      "grad_norm": 1.495429277420044,
+      "learning_rate": 1.1416166707096353e-05,
+      "loss": 0.2012,
+      "step": 940
+    },
+    {
+      "epoch": 3.0095693779904304,
+      "grad_norm": 1.0835736989974976,
+      "learning_rate": 1.1262917318666517e-05,
+      "loss": 0.1885,
+      "step": 945
+    },
+    {
+      "epoch": 3.025518341307815,
+      "grad_norm": 1.299534797668457,
+      "learning_rate": 1.111008270579521e-05,
+      "loss": 0.1783,
+      "step": 950
+    },
+    {
+      "epoch": 3.041467304625199,
+      "grad_norm": 1.0937697887420654,
+      "learning_rate": 1.0957679831463287e-05,
+      "loss": 0.1489,
+      "step": 955
+    },
+    {
+      "epoch": 3.0574162679425836,
+      "grad_norm": 1.0105271339416504,
+      "learning_rate": 1.0805725610733292e-05,
+      "loss": 0.1752,
+      "step": 960
+    },
+    {
+      "epoch": 3.073365231259968,
+      "grad_norm": 1.2927125692367554,
+      "learning_rate": 1.0654236908872103e-05,
+      "loss": 0.1578,
+      "step": 965
+    },
+    {
+      "epoch": 3.0893141945773523,
+      "grad_norm": 1.1609300374984741,
+      "learning_rate": 1.050323053947907e-05,
+      "loss": 0.1423,
+      "step": 970
+    },
+    {
+      "epoch": 3.1052631578947367,
+      "grad_norm": 1.0901206731796265,
+      "learning_rate": 1.0352723262619878e-05,
+      "loss": 0.1497,
+      "step": 975
+    },
+    {
+      "epoch": 3.121212121212121,
+      "grad_norm": 0.9939388036727905,
+      "learning_rate": 1.0202731782966363e-05,
+      "loss": 0.1674,
+      "step": 980
+    },
+    {
+      "epoch": 3.1371610845295055,
+      "grad_norm": 1.1406632661819458,
+      "learning_rate": 1.0053272747942472e-05,
+      "loss": 0.1291,
+      "step": 985
+    },
+    {
+      "epoch": 3.15311004784689,
+      "grad_norm": 1.2126046419143677,
+      "learning_rate": 9.904362745876609e-06,
+      "loss": 0.1455,
+      "step": 990
+    },
+    {
+      "epoch": 3.1690590111642742,
+      "grad_norm": 1.2059701681137085,
+      "learning_rate": 9.756018304160458e-06,
+      "loss": 0.1273,
+      "step": 995
+    },
+    {
+      "epoch": 3.1850079744816586,
+      "grad_norm": 1.1024702787399292,
+      "learning_rate": 9.608255887414673e-06,
+      "loss": 0.1509,
+      "step": 1000
+    },
+    {
+      "epoch": 3.200956937799043,
+      "grad_norm": 1.252359390258789,
+      "learning_rate": 9.46109189566145e-06,
+      "loss": 0.1488,
+      "step": 1005
+    },
+    {
+      "epoch": 3.2169059011164274,
+      "grad_norm": 0.9932821393013,
+      "learning_rate": 9.314542662504316e-06,
+      "loss": 0.158,
+      "step": 1010
+    },
+    {
+      "epoch": 3.2328548644338118,
+      "grad_norm": 1.2389966249465942,
+      "learning_rate": 9.168624453315284e-06,
+      "loss": 0.148,
+      "step": 1015
+    },
+    {
+      "epoch": 3.248803827751196,
+      "grad_norm": 1.2603164911270142,
+      "learning_rate": 9.023353463429556e-06,
+      "loss": 0.1614,
+      "step": 1020
+    },
+    {
+      "epoch": 3.2647527910685805,
+      "grad_norm": 0.96819669008255,
+      "learning_rate": 8.878745816348025e-06,
+      "loss": 0.1436,
+      "step": 1025
+    },
+    {
+      "epoch": 3.280701754385965,
+      "grad_norm": 0.9492572546005249,
+      "learning_rate": 8.734817561947759e-06,
+      "loss": 0.151,
+      "step": 1030
+    },
+    {
+      "epoch": 3.2966507177033493,
+      "grad_norm": 1.0807409286499023,
+      "learning_rate": 8.591584674700613e-06,
+      "loss": 0.1484,
+      "step": 1035
+    },
+    {
+      "epoch": 3.3125996810207337,
+      "grad_norm": 1.1965206861495972,
+      "learning_rate": 8.449063051900233e-06,
+      "loss": 0.1518,
+      "step": 1040
+    },
+    {
+      "epoch": 3.328548644338118,
+      "grad_norm": 1.0824549198150635,
+      "learning_rate": 8.307268511897667e-06,
+      "loss": 0.1423,
+      "step": 1045
+    },
+    {
+      "epoch": 3.3444976076555024,
+      "grad_norm": 1.178969383239746,
+      "learning_rate": 8.166216792345648e-06,
+      "loss": 0.1415,
+      "step": 1050
+    },
+    {
+      "epoch": 3.360446570972887,
+      "grad_norm": 1.2653928995132446,
+      "learning_rate": 8.02592354845194e-06,
+      "loss": 0.1392,
+      "step": 1055
+    },
+    {
+      "epoch": 3.376395534290271,
+      "grad_norm": 1.0409033298492432,
+      "learning_rate": 7.886404351241731e-06,
+      "loss": 0.1189,
+      "step": 1060
+    },
+    {
+      "epoch": 3.3923444976076556,
+      "grad_norm": 1.204380989074707,
+      "learning_rate": 7.747674685829451e-06,
+      "loss": 0.1465,
+      "step": 1065
+    },
+    {
+      "epoch": 3.40829346092504,
+      "grad_norm": 1.1518088579177856,
+      "learning_rate": 7.609749949700084e-06,
+      "loss": 0.1081,
+      "step": 1070
+    },
+    {
+      "epoch": 3.4242424242424243,
+      "grad_norm": 1.0367203950881958,
+      "learning_rate": 7.472645451000214e-06,
+      "loss": 0.1154,
+      "step": 1075
+    },
+    {
+      "epoch": 3.4401913875598087,
+      "grad_norm": 1.1489876508712769,
+      "learning_rate": 7.3363764068389674e-06,
+      "loss": 0.1339,
+      "step": 1080
+    },
+    {
+      "epoch": 3.456140350877193,
+      "grad_norm": 1.1962957382202148,
+      "learning_rate": 7.200957941599126e-06,
+      "loss": 0.1519,
+      "step": 1085
+    },
+    {
+      "epoch": 3.4720893141945774,
+      "grad_norm": 1.111609697341919,
+      "learning_rate": 7.066405085258427e-06,
+      "loss": 0.1386,
+      "step": 1090
+    },
+    {
+      "epoch": 3.488038277511962,
+      "grad_norm": 1.0738730430603027,
+      "learning_rate": 6.932732771721447e-06,
+      "loss": 0.1281,
+      "step": 1095
+    },
+    {
+      "epoch": 3.503987240829346,
+      "grad_norm": 1.0701189041137695,
+      "learning_rate": 6.799955837162082e-06,
+      "loss": 0.1189,
+      "step": 1100
+    },
+    {
+      "epoch": 3.5199362041467306,
+      "grad_norm": 1.2711684703826904,
+      "learning_rate": 6.668089018376892e-06,
+      "loss": 0.1256,
+      "step": 1105
+    },
+    {
+      "epoch": 3.535885167464115,
+      "grad_norm": 0.9734954833984375,
+      "learning_rate": 6.537146951149463e-06,
+      "loss": 0.1268,
+      "step": 1110
+    },
+    {
+      "epoch": 3.5518341307814993,
+      "grad_norm": 1.0430442094802856,
+      "learning_rate": 6.407144168626038e-06,
+      "loss": 0.1308,
+      "step": 1115
+    },
+    {
+      "epoch": 3.5677830940988837,
+      "grad_norm": 1.088063359260559,
+      "learning_rate": 6.2780950997024345e-06,
+      "loss": 0.1225,
+      "step": 1120
+    },
+    {
+      "epoch": 3.583732057416268,
+      "grad_norm": 1.3580750226974487,
+      "learning_rate": 6.1500140674226575e-06,
+      "loss": 0.1486,
+      "step": 1125
+    },
+    {
+      "epoch": 3.5996810207336525,
+      "grad_norm": 1.069808006286621,
+      "learning_rate": 6.02291528738914e-06,
+      "loss": 0.1387,
+      "step": 1130
+    },
+    {
+      "epoch": 3.6156299840510364,
+      "grad_norm": 1.0957990884780884,
+      "learning_rate": 5.896812866185011e-06,
+      "loss": 0.133,
+      "step": 1135
+    },
+    {
+      "epoch": 3.6315789473684212,
+      "grad_norm": 1.1064181327819824,
+      "learning_rate": 5.7717207998083895e-06,
+      "loss": 0.138,
+      "step": 1140
+    },
+    {
+      "epoch": 3.647527910685805,
+      "grad_norm": 1.1944586038589478,
+      "learning_rate": 5.647652972118998e-06,
+      "loss": 0.1339,
+      "step": 1145
+    },
+    {
+      "epoch": 3.66347687400319,
+      "grad_norm": 1.0332287549972534,
+      "learning_rate": 5.524623153297183e-06,
+      "loss": 0.1287,
+      "step": 1150
+    },
+    {
+      "epoch": 3.679425837320574,
+      "grad_norm": 1.0928303003311157,
+      "learning_rate": 5.402644998315609e-06,
+      "loss": 0.1262,
+      "step": 1155
+    },
+    {
+      "epoch": 3.6953748006379588,
+      "grad_norm": 1.2543113231658936,
+      "learning_rate": 5.281732045423664e-06,
+      "loss": 0.1443,
+      "step": 1160
+    },
+    {
+      "epoch": 3.7113237639553427,
+      "grad_norm": 0.9908517003059387,
+      "learning_rate": 5.1618977146449e-06,
+      "loss": 0.1091,
+      "step": 1165
+    },
+    {
+      "epoch": 3.7272727272727275,
+      "grad_norm": 1.0878533124923706,
+      "learning_rate": 5.04315530628752e-06,
+      "loss": 0.1064,
+      "step": 1170
+    },
+    {
+      "epoch": 3.7432216905901115,
+      "grad_norm": 1.1831837892532349,
+      "learning_rate": 4.925517999468232e-06,
+      "loss": 0.1234,
+      "step": 1175
+    },
+    {
+      "epoch": 3.7591706539074963,
+      "grad_norm": 0.9875324368476868,
+      "learning_rate": 4.808998850649456e-06,
+      "loss": 0.1181,
+      "step": 1180
+    },
+    {
+      "epoch": 3.77511961722488,
+      "grad_norm": 1.011039137840271,
+      "learning_rate": 4.693610792190252e-06,
+      "loss": 0.1121,
+      "step": 1185
+    },
+    {
+      "epoch": 3.7910685805422646,
+      "grad_norm": 1.1317858695983887,
+      "learning_rate": 4.579366630910923e-06,
+      "loss": 0.116,
+      "step": 1190
+    },
+    {
+      "epoch": 3.807017543859649,
+      "grad_norm": 0.919337272644043,
+      "learning_rate": 4.466279046671637e-06,
+      "loss": 0.0911,
+      "step": 1195
+    },
+    {
+      "epoch": 3.8229665071770333,
+      "grad_norm": 1.0885950326919556,
+      "learning_rate": 4.3543605909650676e-06,
+      "loss": 0.1339,
+      "step": 1200
+    },
+    {
+      "epoch": 3.8389154704944177,
+      "grad_norm": 0.9658224582672119,
+      "learning_rate": 4.243623685523341e-06,
+      "loss": 0.1289,
+      "step": 1205
+    },
+    {
+      "epoch": 3.854864433811802,
+      "grad_norm": 0.9771375060081482,
+      "learning_rate": 4.134080620939325e-06,
+      "loss": 0.1214,
+      "step": 1210
+    },
+    {
+      "epoch": 3.8708133971291865,
+      "grad_norm": 0.993411123752594,
+      "learning_rate": 4.025743555302564e-06,
+      "loss": 0.1117,
+      "step": 1215
+    },
+    {
+      "epoch": 3.886762360446571,
+      "grad_norm": 1.0799087285995483,
+      "learning_rate": 3.918624512849791e-06,
+      "loss": 0.1062,
+      "step": 1220
+    },
+    {
+      "epoch": 3.9027113237639552,
+      "grad_norm": 1.17735755443573,
+      "learning_rate": 3.8127353826304303e-06,
+      "loss": 0.1354,
+      "step": 1225
+    },
+    {
+      "epoch": 3.9186602870813396,
+      "grad_norm": 0.9857722520828247,
+      "learning_rate": 3.7080879171869967e-06,
+      "loss": 0.1194,
+      "step": 1230
+    },
+    {
+      "epoch": 3.934609250398724,
+      "grad_norm": 1.1036955118179321,
+      "learning_rate": 3.6046937312507296e-06,
+      "loss": 0.1201,
+      "step": 1235
+    },
+    {
+      "epoch": 3.9505582137161084,
+      "grad_norm": 1.0318442583084106,
+      "learning_rate": 3.5025643004524467e-06,
+      "loss": 0.1114,
+      "step": 1240
+    },
+    {
+      "epoch": 3.9665071770334928,
+      "grad_norm": 0.9945523738861084,
+      "learning_rate": 3.4017109600489068e-06,
+      "loss": 0.0964,
+      "step": 1245
+    },
+    {
+      "epoch": 3.982456140350877,
+      "grad_norm": 1.079817533493042,
+      "learning_rate": 3.302144903664698e-06,
+      "loss": 0.1193,
+      "step": 1250
+    },
+    {
+      "epoch": 3.9984051036682615,
+      "grad_norm": 1.1149638891220093,
+      "learning_rate": 3.2038771820498834e-06,
+      "loss": 0.1272,
+      "step": 1255
+    },
+    {
+      "epoch": 4.012759170653908,
+      "grad_norm": 0.9764154553413391,
+      "learning_rate": 3.1069187018534596e-06,
+      "loss": 0.0872,
+      "step": 1260
+    },
+    {
+      "epoch": 4.028708133971292,
+      "grad_norm": 0.7976694107055664,
+      "learning_rate": 3.0112802244128757e-06,
+      "loss": 0.0914,
+      "step": 1265
+    },
+    {
+      "epoch": 4.044657097288677,
+      "grad_norm": 0.8424039483070374,
+      "learning_rate": 2.9169723645596075e-06,
+      "loss": 0.0932,
+      "step": 1270
+    },
+    {
+      "epoch": 4.0606060606060606,
+      "grad_norm": 0.8919062614440918,
+      "learning_rate": 2.8240055894410554e-06,
+      "loss": 0.0915,
+      "step": 1275
+    },
+    {
+      "epoch": 4.076555023923445,
+      "grad_norm": 0.8185088634490967,
+      "learning_rate": 2.7323902173587713e-06,
+      "loss": 0.0943,
+      "step": 1280
+    },
+    {
+      "epoch": 4.092503987240829,
+      "grad_norm": 0.9200054407119751,
+      "learning_rate": 2.6421364166232863e-06,
+      "loss": 0.0967,
+      "step": 1285
+    },
+    {
+      "epoch": 4.108452950558214,
+      "grad_norm": 0.9717190861701965,
+      "learning_rate": 2.553254204425483e-06,
+      "loss": 0.0966,
+      "step": 1290
+    },
+    {
+      "epoch": 4.124401913875598,
+      "grad_norm": 0.9343546032905579,
+      "learning_rate": 2.465753445724847e-06,
+      "loss": 0.0972,
+      "step": 1295
+    },
+    {
+      "epoch": 4.140350877192983,
+      "grad_norm": 0.9351934790611267,
+      "learning_rate": 2.379643852154521e-06,
+      "loss": 0.1019,
+      "step": 1300
+    },
+    {
+      "epoch": 4.156299840510367,
+      "grad_norm": 0.7550760507583618,
+      "learning_rate": 2.2949349809434567e-06,
+      "loss": 0.0908,
+      "step": 1305
+    },
+    {
+      "epoch": 4.172248803827751,
+      "grad_norm": 0.8611054420471191,
+      "learning_rate": 2.211636233855636e-06,
+      "loss": 0.0737,
+      "step": 1310
+    },
+    {
+      "epoch": 4.188197767145136,
+      "grad_norm": 0.94122713804245,
+      "learning_rate": 2.129756856146599e-06,
+      "loss": 0.0964,
+      "step": 1315
+    },
+    {
+      "epoch": 4.2041467304625195,
+      "grad_norm": 0.8340793251991272,
+      "learning_rate": 2.049305935537301e-06,
+      "loss": 0.09,
+      "step": 1320
+    },
+    {
+      "epoch": 4.220095693779904,
+      "grad_norm": 0.7951894998550415,
+      "learning_rate": 1.9702924012055045e-06,
+      "loss": 0.0992,
+      "step": 1325
+    },
+    {
+      "epoch": 4.236044657097288,
+      "grad_norm": 0.8139170408248901,
+      "learning_rate": 1.8927250227946957e-06,
+      "loss": 0.0909,
+      "step": 1330
+    },
+    {
+      "epoch": 4.251993620414673,
+      "grad_norm": 0.9547438621520996,
+      "learning_rate": 1.816612409440792e-06,
+      "loss": 0.079,
+      "step": 1335
+    },
+    {
+      "epoch": 4.267942583732057,
+      "grad_norm": 0.889021098613739,
+      "learning_rate": 1.7419630088165832e-06,
+      "loss": 0.0871,
+      "step": 1340
+    },
+    {
+      "epoch": 4.283891547049442,
+      "grad_norm": 0.9018152952194214,
+      "learning_rate": 1.6687851061941694e-06,
+      "loss": 0.0974,
+      "step": 1345
+    },
+    {
+      "epoch": 4.299840510366826,
+      "grad_norm": 0.8666483759880066,
+      "learning_rate": 1.5970868235253466e-06,
+      "loss": 0.1071,
+      "step": 1350
+    },
+    {
+      "epoch": 4.315789473684211,
+      "grad_norm": 0.9299699068069458,
+      "learning_rate": 1.526876118540193e-06,
+      "loss": 0.0953,
+      "step": 1355
+    },
+    {
+      "epoch": 4.3317384370015946,
+      "grad_norm": 0.785976767539978,
+      "learning_rate": 1.458160783863829e-06,
+      "loss": 0.0948,
+      "step": 1360
+    },
+    {
+      "epoch": 4.347687400318979,
+      "grad_norm": 0.8794880509376526,
+      "learning_rate": 1.3909484461515247e-06,
+      "loss": 0.0839,
+      "step": 1365
+    },
+    {
+      "epoch": 4.363636363636363,
+      "grad_norm": 0.8219895958900452,
+      "learning_rate": 1.3252465652422157e-06,
+      "loss": 0.0785,
+      "step": 1370
+    },
+    {
+      "epoch": 4.379585326953748,
+      "grad_norm": 0.8761910200119019,
+      "learning_rate": 1.2610624333305632e-06,
+      "loss": 0.0879,
+      "step": 1375
+    },
+    {
+      "epoch": 4.395534290271132,
+      "grad_norm": 0.7061748504638672,
+      "learning_rate": 1.1984031741575724e-06,
+      "loss": 0.0899,
+      "step": 1380
+    },
+    {
+      "epoch": 4.411483253588517,
+      "grad_norm": 0.94333416223526,
+      "learning_rate": 1.137275742219962e-06,
+      "loss": 0.0955,
+      "step": 1385
+    },
+    {
+      "epoch": 4.427432216905901,
+      "grad_norm": 0.9461633563041687,
+      "learning_rate": 1.0776869219982643e-06,
+      "loss": 0.0856,
+      "step": 1390
+    },
+    {
+      "epoch": 4.443381180223286,
+      "grad_norm": 0.9480202794075012,
+      "learning_rate": 1.0196433272038464e-06,
+      "loss": 0.0917,
+      "step": 1395
+    },
+    {
+      "epoch": 4.45933014354067,
+      "grad_norm": 0.932051956653595,
+      "learning_rate": 9.631514000448398e-07,
+      "loss": 0.0804,
+      "step": 1400
+    },
+    {
+      "epoch": 4.475279106858054,
+      "grad_norm": 0.7485371232032776,
+      "learning_rate": 9.082174105111423e-07,
+      "loss": 0.0808,
+      "step": 1405
+    },
+    {
+      "epoch": 4.491228070175438,
+      "grad_norm": 0.9261391162872314,
+      "learning_rate": 8.548474556784997e-07,
+      "loss": 0.0832,
+      "step": 1410
+    },
+    {
+      "epoch": 4.507177033492823,
+      "grad_norm": 0.8479756712913513,
+      "learning_rate": 8.030474590318109e-07,
+      "loss": 0.0802,
+      "step": 1415
+    },
+    {
+      "epoch": 4.523125996810207,
+      "grad_norm": 0.7261364459991455,
+      "learning_rate": 7.528231698076765e-07,
+      "loss": 0.0851,
+      "step": 1420
+    },
+    {
+      "epoch": 4.539074960127592,
+      "grad_norm": 0.7175788283348083,
+      "learning_rate": 7.041801623563077e-07,
+      "loss": 0.0894,
+      "step": 1425
+    },
+    {
+      "epoch": 4.555023923444976,
+      "grad_norm": 0.7924843430519104,
+      "learning_rate": 6.571238355228137e-07,
+      "loss": 0.0898,
+      "step": 1430
+    },
+    {
+      "epoch": 4.570972886762361,
+      "grad_norm": 0.8615280985832214,
+      "learning_rate": 6.116594120480179e-07,
+      "loss": 0.0871,
+      "step": 1435
+    },
+    {
+      "epoch": 4.586921850079745,
+      "grad_norm": 0.8503677248954773,
+      "learning_rate": 5.677919379887575e-07,
+      "loss": 0.0898,
+      "step": 1440
+    },
+    {
+      "epoch": 4.6028708133971294,
+      "grad_norm": 0.8476049304008484,
+      "learning_rate": 5.255262821578521e-07,
+      "loss": 0.0775,
+      "step": 1445
+    },
+    {
+      "epoch": 4.618819776714513,
+      "grad_norm": 0.8376314043998718,
+      "learning_rate": 4.848671355837026e-07,
+      "loss": 0.0884,
+      "step": 1450
+    },
+    {
+      "epoch": 4.634768740031898,
+      "grad_norm": 0.9263095259666443,
+      "learning_rate": 4.4581901098964874e-07,
+      "loss": 0.0927,
+      "step": 1455
+    },
+    {
+      "epoch": 4.650717703349282,
+      "grad_norm": 0.8247036933898926,
+      "learning_rate": 4.0838624229309563e-07,
+      "loss": 0.0867,
+      "step": 1460
+    },
+    {
+      "epoch": 4.666666666666667,
+      "grad_norm": 0.9262139201164246,
+      "learning_rate": 3.7257298412450625e-07,
+      "loss": 0.0893,
+      "step": 1465
+    },
+    {
+      "epoch": 4.682615629984051,
+      "grad_norm": 0.8191598057746887,
+      "learning_rate": 3.3838321136627715e-07,
+      "loss": 0.0804,
+      "step": 1470
+    },
+    {
+      "epoch": 4.698564593301436,
+      "grad_norm": 0.8580271601676941,
+      "learning_rate": 3.05820718711568e-07,
+      "loss": 0.0758,
+      "step": 1475
+    },
+    {
+      "epoch": 4.71451355661882,
+      "grad_norm": 0.6885163187980652,
+      "learning_rate": 2.748891202431353e-07,
+      "loss": 0.0814,
+      "step": 1480
+    },
+    {
+      "epoch": 4.7304625199362045,
+      "grad_norm": 0.9642685651779175,
+      "learning_rate": 2.455918490322062e-07,
+      "loss": 0.0992,
+      "step": 1485
+    },
+    {
+      "epoch": 4.746411483253588,
+      "grad_norm": 0.8175750970840454,
+      "learning_rate": 2.1793215675744882e-07,
+      "loss": 0.0728,
+      "step": 1490
+    },
+    {
+      "epoch": 4.762360446570973,
+      "grad_norm": 0.8122966289520264,
+      "learning_rate": 1.9191311334406713e-07,
+      "loss": 0.0997,
+      "step": 1495
+    },
+    {
+      "epoch": 4.778309409888357,
+      "grad_norm": 0.9818465113639832,
+      "learning_rate": 1.6753760662307217e-07,
+      "loss": 0.0957,
+      "step": 1500
+    },
+    {
+      "epoch": 4.794258373205742,
+      "grad_norm": 0.8007465600967407,
+      "learning_rate": 1.4480834201076987e-07,
+      "loss": 0.0769,
+      "step": 1505
+    },
+    {
+      "epoch": 4.810207336523126,
+      "grad_norm": 1.1908260583877563,
+      "learning_rate": 1.2372784220847976e-07,
+      "loss": 0.0915,
+      "step": 1510
+    },
+    {
+      "epoch": 4.826156299840511,
+      "grad_norm": 1.0244992971420288,
+      "learning_rate": 1.0429844692255719e-07,
+      "loss": 0.0818,
+      "step": 1515
+    },
+    {
+      "epoch": 4.842105263157895,
+      "grad_norm": 0.7050264477729797,
+      "learning_rate": 8.652231260469267e-08,
+      "loss": 0.0739,
+      "step": 1520
+    },
+    {
+      "epoch": 4.858054226475279,
+      "grad_norm": 0.8911679983139038,
+      "learning_rate": 7.040141221258511e-08,
+      "loss": 0.0857,
+      "step": 1525
+    },
+    {
+      "epoch": 4.8740031897926634,
+      "grad_norm": 0.7882342338562012,
+      "learning_rate": 5.593753499095522e-08,
+      "loss": 0.0833,
+      "step": 1530
+    },
+    {
+      "epoch": 4.889952153110048,
+      "grad_norm": 0.912810742855072,
+      "learning_rate": 4.313228627296606e-08,
+      "loss": 0.0933,
+      "step": 1535
+    },
+    {
+      "epoch": 4.905901116427432,
+      "grad_norm": 0.8646571040153503,
+      "learning_rate": 3.1987087302040585e-08,
+      "loss": 0.0805,
+      "step": 1540
+    },
+    {
+      "epoch": 4.921850079744816,
+      "grad_norm": 0.7618115544319153,
+      "learning_rate": 2.250317507412447e-08,
+      "loss": 0.0847,
+      "step": 1545
+    },
+    {
+      "epoch": 4.937799043062201,
+      "grad_norm": 0.7340676784515381,
+      "learning_rate": 1.4681602200395938e-08,
+      "loss": 0.095,
+      "step": 1550
+    },
+    {
+      "epoch": 4.953748006379586,
+      "grad_norm": 0.776946485042572,
+      "learning_rate": 8.523236790427547e-09,
+      "loss": 0.0868,
+      "step": 1555
+    },
+    {
+      "epoch": 4.96969696969697,
+      "grad_norm": 0.7228500247001648,
+      "learning_rate": 4.028762355841598e-09,
+      "loss": 0.0792,
+      "step": 1560
+    },
+    {
+      "epoch": 4.985645933014354,
+      "grad_norm": 0.8218967318534851,
+      "learning_rate": 1.1986777344524802e-09,
+      "loss": 0.0973,
+      "step": 1565
+    },
+    {
+      "epoch": 5.0,
+      "grad_norm": 1.0494500398635864,
+      "learning_rate": 3.329703489096669e-11,
+      "loss": 0.0834,
+      "step": 1570
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1570,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 2.137548515289596e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

13_128_e5_3e-5/checkpoint-1570/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91232031982acf747326dd0572af7262d829f9122bf0cae34fb47e91305f6c59
+size 7736

13_128_e5_3e-5/checkpoint-1570/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-1570/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

13_128_e5_3e-5/checkpoint-314/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

13_128_e5_3e-5/checkpoint-314/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "down_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "k_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

13_128_e5_3e-5/checkpoint-314/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f6570c667e958b003a7ec5acc8821a5c277575e9abeeb1fca806edb72be2944
+size 791751704

13_128_e5_3e-5/checkpoint-314/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step314

13_128_e5_3e-5/checkpoint-314/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

13_128_e5_3e-5/checkpoint-314/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2dd00a08621c94044ddf114171204ea00af5e9fbcb7603dd2b967e9660b9a535
+size 15920

13_128_e5_3e-5/checkpoint-314/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b485b2ceb684ac8f592b7e2f9adcb93ddf9d41761ecc8a25e6ce9bc0970c13f1
+size 15920

13_128_e5_3e-5/checkpoint-314/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76eb3902ea153254b65a0ca7cc1df81816ab8ad33bfeb8177041caa11c5a5ce5
+size 15920