RayDu0010 commited on Jun 26, 2025

Commit

c3ce296

verified ·

1 Parent(s): b1ba149

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

15_128_e5_3e-5/checkpoint-1119/README.md +202 -0
15_128_e5_3e-5/checkpoint-1119/adapter_config.json +39 -0
15_128_e5_3e-5/checkpoint-1119/adapter_model.safetensors +3 -0
15_128_e5_3e-5/checkpoint-1119/latest +1 -0
15_128_e5_3e-5/checkpoint-1119/merges.txt +0 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_0.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_1.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_2.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_3.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_4.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_5.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_6.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/rng_state_7.pth +3 -0
15_128_e5_3e-5/checkpoint-1119/scheduler.pt +3 -0
15_128_e5_3e-5/checkpoint-1119/special_tokens_map.json +45 -0
15_128_e5_3e-5/checkpoint-1119/tokenizer.json +0 -0
15_128_e5_3e-5/checkpoint-1119/tokenizer_config.json +188 -0
15_128_e5_3e-5/checkpoint-1119/trainer_state.json +1595 -0
15_128_e5_3e-5/checkpoint-1119/training_args.bin +3 -0
15_128_e5_3e-5/checkpoint-1119/vocab.json +0 -0
15_128_e5_3e-5/checkpoint-1119/zero_to_fp32.py +604 -0
15_128_e5_3e-5/checkpoint-1492/README.md +202 -0
15_128_e5_3e-5/checkpoint-1492/adapter_config.json +39 -0
15_128_e5_3e-5/checkpoint-1492/adapter_model.safetensors +3 -0
15_128_e5_3e-5/checkpoint-1492/latest +1 -0
15_128_e5_3e-5/checkpoint-1492/merges.txt +0 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_0.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_1.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_2.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_3.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_4.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_5.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_6.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/rng_state_7.pth +3 -0
15_128_e5_3e-5/checkpoint-1492/scheduler.pt +3 -0
15_128_e5_3e-5/checkpoint-1492/special_tokens_map.json +45 -0
15_128_e5_3e-5/checkpoint-1492/tokenizer.json +0 -0
15_128_e5_3e-5/checkpoint-1492/tokenizer_config.json +188 -0
15_128_e5_3e-5/checkpoint-1492/trainer_state.json +2120 -0
15_128_e5_3e-5/checkpoint-1492/training_args.bin +3 -0
15_128_e5_3e-5/checkpoint-1492/vocab.json +0 -0
15_128_e5_3e-5/checkpoint-1492/zero_to_fp32.py +604 -0
15_128_e5_3e-5/checkpoint-1865/README.md +202 -0
15_128_e5_3e-5/checkpoint-1865/adapter_config.json +39 -0
15_128_e5_3e-5/checkpoint-1865/adapter_model.safetensors +3 -0
15_128_e5_3e-5/checkpoint-1865/latest +1 -0
15_128_e5_3e-5/checkpoint-1865/merges.txt +0 -0
15_128_e5_3e-5/checkpoint-1865/rng_state_0.pth +3 -0
15_128_e5_3e-5/checkpoint-1865/rng_state_1.pth +3 -0
15_128_e5_3e-5/checkpoint-1865/rng_state_2.pth +3 -0

15_128_e5_3e-5/checkpoint-1119/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

15_128_e5_3e-5/checkpoint-1119/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "gate_proj",
+    "down_proj",
+    "up_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

15_128_e5_3e-5/checkpoint-1119/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9de43794f29b87a014601ba52e5e9845cc10cd18f944429a86bdb71609fc8f48
+size 791751704

15_128_e5_3e-5/checkpoint-1119/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1119

15_128_e5_3e-5/checkpoint-1119/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1119/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b04123748aadd764e04cac45b677e42d554e4dde2834ec68f3bcce3b2e0ec820
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d1d296a2268ab42ce6785f816903d27116410eda86ce318e3108ff6876e0d6c5
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c5b32468186235a457c3219a1c1f59e987397c4242b8c64d90a16ecdfc9c184
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb30564649a52968def973e531dd7d802a0bd2ff735aa0879572b6dacfb1166b
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4055acb3e7c1737e950b7582dd27cbde4aa36b1ce289940ef4ca904ec258c514
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c57d87283b66501d5d1aa8f268d67d2bdac43c56b65907365f994d9e3fa51d82
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f41f6825abc923397e019f1bd10be541a3965317df43ab4d161677798039c13d
+size 15920

15_128_e5_3e-5/checkpoint-1119/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9ab24d08d7025db18d8a7c1f731179163d81484e9e047fcb9a67201c643135b
+size 15920

15_128_e5_3e-5/checkpoint-1119/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a129e02303147cc2d7f5213409bed45a6f0216dbabe699826203017f31f1d718
+size 1064

15_128_e5_3e-5/checkpoint-1119/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

15_128_e5_3e-5/checkpoint-1119/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1119/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

15_128_e5_3e-5/checkpoint-1119/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1595 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 3.0,
+  "eval_steps": 500,
+  "global_step": 1119,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.013422818791946308,
+      "grad_norm": 1.2052627801895142,
+      "learning_rate": 1.276595744680851e-06,
+      "loss": 1.3229,
+      "step": 5
+    },
+    {
+      "epoch": 0.026845637583892617,
+      "grad_norm": 1.3943337202072144,
+      "learning_rate": 2.872340425531915e-06,
+      "loss": 1.3715,
+      "step": 10
+    },
+    {
+      "epoch": 0.040268456375838924,
+      "grad_norm": 0.7524759769439697,
+      "learning_rate": 4.468085106382979e-06,
+      "loss": 1.3464,
+      "step": 15
+    },
+    {
+      "epoch": 0.053691275167785234,
+      "grad_norm": 0.5896478295326233,
+      "learning_rate": 6.063829787234042e-06,
+      "loss": 1.2983,
+      "step": 20
+    },
+    {
+      "epoch": 0.06711409395973154,
+      "grad_norm": 0.6688717007637024,
+      "learning_rate": 7.659574468085105e-06,
+      "loss": 1.324,
+      "step": 25
+    },
+    {
+      "epoch": 0.08053691275167785,
+      "grad_norm": 0.6408427357673645,
+      "learning_rate": 9.255319148936171e-06,
+      "loss": 1.2316,
+      "step": 30
+    },
+    {
+      "epoch": 0.09395973154362416,
+      "grad_norm": 0.6618767380714417,
+      "learning_rate": 1.0851063829787235e-05,
+      "loss": 1.2444,
+      "step": 35
+    },
+    {
+      "epoch": 0.10738255033557047,
+      "grad_norm": 0.5582337379455566,
+      "learning_rate": 1.2446808510638298e-05,
+      "loss": 1.2841,
+      "step": 40
+    },
+    {
+      "epoch": 0.12080536912751678,
+      "grad_norm": 0.5900721549987793,
+      "learning_rate": 1.4042553191489362e-05,
+      "loss": 1.1743,
+      "step": 45
+    },
+    {
+      "epoch": 0.1342281879194631,
+      "grad_norm": 0.5596632957458496,
+      "learning_rate": 1.5638297872340426e-05,
+      "loss": 1.2086,
+      "step": 50
+    },
+    {
+      "epoch": 0.1476510067114094,
+      "grad_norm": 0.4909383952617645,
+      "learning_rate": 1.723404255319149e-05,
+      "loss": 1.2367,
+      "step": 55
+    },
+    {
+      "epoch": 0.1610738255033557,
+      "grad_norm": 0.5068446397781372,
+      "learning_rate": 1.8829787234042554e-05,
+      "loss": 1.2154,
+      "step": 60
+    },
+    {
+      "epoch": 0.174496644295302,
+      "grad_norm": 0.5289482474327087,
+      "learning_rate": 2.0425531914893616e-05,
+      "loss": 1.2072,
+      "step": 65
+    },
+    {
+      "epoch": 0.18791946308724833,
+      "grad_norm": 0.5647903680801392,
+      "learning_rate": 2.2021276595744682e-05,
+      "loss": 1.1745,
+      "step": 70
+    },
+    {
+      "epoch": 0.20134228187919462,
+      "grad_norm": 0.7343299388885498,
+      "learning_rate": 2.3617021276595744e-05,
+      "loss": 1.1381,
+      "step": 75
+    },
+    {
+      "epoch": 0.21476510067114093,
+      "grad_norm": 0.49988463521003723,
+      "learning_rate": 2.521276595744681e-05,
+      "loss": 1.2065,
+      "step": 80
+    },
+    {
+      "epoch": 0.22818791946308725,
+      "grad_norm": 0.5250614285469055,
+      "learning_rate": 2.6808510638297873e-05,
+      "loss": 1.1187,
+      "step": 85
+    },
+    {
+      "epoch": 0.24161073825503357,
+      "grad_norm": 0.5745874643325806,
+      "learning_rate": 2.8404255319148935e-05,
+      "loss": 1.1685,
+      "step": 90
+    },
+    {
+      "epoch": 0.2550335570469799,
+      "grad_norm": 0.661177396774292,
+      "learning_rate": 3e-05,
+      "loss": 1.1199,
+      "step": 95
+    },
+    {
+      "epoch": 0.2684563758389262,
+      "grad_norm": 0.6243076920509338,
+      "learning_rate": 2.999940998772382e-05,
+      "loss": 1.1587,
+      "step": 100
+    },
+    {
+      "epoch": 0.28187919463087246,
+      "grad_norm": 0.5835176706314087,
+      "learning_rate": 2.9997639997310543e-05,
+      "loss": 1.12,
+      "step": 105
+    },
+    {
+      "epoch": 0.2953020134228188,
+      "grad_norm": 0.573341965675354,
+      "learning_rate": 2.9994690168002316e-05,
+      "loss": 1.1094,
+      "step": 110
+    },
+    {
+      "epoch": 0.3087248322147651,
+      "grad_norm": 0.655681848526001,
+      "learning_rate": 2.9990560731857203e-05,
+      "loss": 1.0707,
+      "step": 115
+    },
+    {
+      "epoch": 0.3221476510067114,
+      "grad_norm": 0.7921518087387085,
+      "learning_rate": 2.9985252013730937e-05,
+      "loss": 1.0758,
+      "step": 120
+    },
+    {
+      "epoch": 0.33557046979865773,
+      "grad_norm": 0.7320600748062134,
+      "learning_rate": 2.9978764431251368e-05,
+      "loss": 1.0883,
+      "step": 125
+    },
+    {
+      "epoch": 0.348993288590604,
+      "grad_norm": 0.6482290625572205,
+      "learning_rate": 2.9971098494785612e-05,
+      "loss": 1.065,
+      "step": 130
+    },
+    {
+      "epoch": 0.3624161073825503,
+      "grad_norm": 0.6432045698165894,
+      "learning_rate": 2.9962254807399876e-05,
+      "loss": 1.0591,
+      "step": 135
+    },
+    {
+      "epoch": 0.37583892617449666,
+      "grad_norm": 0.879544734954834,
+      "learning_rate": 2.9952234064812045e-05,
+      "loss": 1.0341,
+      "step": 140
+    },
+    {
+      "epoch": 0.38926174496644295,
+      "grad_norm": 0.5897073149681091,
+      "learning_rate": 2.9941037055336938e-05,
+      "loss": 0.9921,
+      "step": 145
+    },
+    {
+      "epoch": 0.40268456375838924,
+      "grad_norm": 0.6965837478637695,
+      "learning_rate": 2.9928664659824302e-05,
+      "loss": 0.9438,
+      "step": 150
+    },
+    {
+      "epoch": 0.4161073825503356,
+      "grad_norm": 0.7513095736503601,
+      "learning_rate": 2.991511785158949e-05,
+      "loss": 1.0137,
+      "step": 155
+    },
+    {
+      "epoch": 0.42953020134228187,
+      "grad_norm": 0.6117614507675171,
+      "learning_rate": 2.990039769633693e-05,
+      "loss": 1.046,
+      "step": 160
+    },
+    {
+      "epoch": 0.4429530201342282,
+      "grad_norm": 0.7419096827507019,
+      "learning_rate": 2.9884505352076267e-05,
+      "loss": 0.982,
+      "step": 165
+    },
+    {
+      "epoch": 0.4563758389261745,
+      "grad_norm": 0.7812432646751404,
+      "learning_rate": 2.986744206903125e-05,
+      "loss": 0.9778,
+      "step": 170
+    },
+    {
+      "epoch": 0.4697986577181208,
+      "grad_norm": 0.759610652923584,
+      "learning_rate": 2.984920918954142e-05,
+      "loss": 0.9469,
+      "step": 175
+    },
+    {
+      "epoch": 0.48322147651006714,
+      "grad_norm": 0.6890200972557068,
+      "learning_rate": 2.982980814795647e-05,
+      "loss": 0.9265,
+      "step": 180
+    },
+    {
+      "epoch": 0.4966442953020134,
+      "grad_norm": 0.8574821949005127,
+      "learning_rate": 2.980924047052343e-05,
+      "loss": 1.0086,
+      "step": 185
+    },
+    {
+      "epoch": 0.5100671140939598,
+      "grad_norm": 0.8379565477371216,
+      "learning_rate": 2.9787507775266585e-05,
+      "loss": 0.9134,
+      "step": 190
+    },
+    {
+      "epoch": 0.5234899328859061,
+      "grad_norm": 0.7856785655021667,
+      "learning_rate": 2.9764611771860203e-05,
+      "loss": 0.9333,
+      "step": 195
+    },
+    {
+      "epoch": 0.5369127516778524,
+      "grad_norm": 0.7260093092918396,
+      "learning_rate": 2.974055426149403e-05,
+      "loss": 0.9066,
+      "step": 200
+    },
+    {
+      "epoch": 0.5503355704697986,
+      "grad_norm": 0.8396151065826416,
+      "learning_rate": 2.9715337136731593e-05,
+      "loss": 0.917,
+      "step": 205
+    },
+    {
+      "epoch": 0.5637583892617449,
+      "grad_norm": 0.8800502419471741,
+      "learning_rate": 2.9688962381361317e-05,
+      "loss": 0.9126,
+      "step": 210
+    },
+    {
+      "epoch": 0.5771812080536913,
+      "grad_norm": 0.833662748336792,
+      "learning_rate": 2.966143207024046e-05,
+      "loss": 0.9177,
+      "step": 215
+    },
+    {
+      "epoch": 0.5906040268456376,
+      "grad_norm": 0.8988651633262634,
+      "learning_rate": 2.9632748369131893e-05,
+      "loss": 0.8965,
+      "step": 220
+    },
+    {
+      "epoch": 0.6040268456375839,
+      "grad_norm": 0.757098376750946,
+      "learning_rate": 2.9602913534533717e-05,
+      "loss": 0.8824,
+      "step": 225
+    },
+    {
+      "epoch": 0.6174496644295302,
+      "grad_norm": 0.7813494205474854,
+      "learning_rate": 2.9571929913501764e-05,
+      "loss": 0.8686,
+      "step": 230
+    },
+    {
+      "epoch": 0.6308724832214765,
+      "grad_norm": 0.9873630404472351,
+      "learning_rate": 2.9539799943464923e-05,
+      "loss": 0.8493,
+      "step": 235
+    },
+    {
+      "epoch": 0.6442953020134228,
+      "grad_norm": 0.843078076839447,
+      "learning_rate": 2.9506526152033436e-05,
+      "loss": 0.8322,
+      "step": 240
+    },
+    {
+      "epoch": 0.6577181208053692,
+      "grad_norm": 0.9392512440681458,
+      "learning_rate": 2.947211115680003e-05,
+      "loss": 0.8489,
+      "step": 245
+    },
+    {
+      "epoch": 0.6711409395973155,
+      "grad_norm": 0.9625651240348816,
+      "learning_rate": 2.943655766513399e-05,
+      "loss": 0.8091,
+      "step": 250
+    },
+    {
+      "epoch": 0.6845637583892618,
+      "grad_norm": 1.0299538373947144,
+      "learning_rate": 2.939986847396818e-05,
+      "loss": 0.8257,
+      "step": 255
+    },
+    {
+      "epoch": 0.697986577181208,
+      "grad_norm": 0.9097337126731873,
+      "learning_rate": 2.936204646957904e-05,
+      "loss": 0.8299,
+      "step": 260
+    },
+    {
+      "epoch": 0.7114093959731543,
+      "grad_norm": 0.9454818964004517,
+      "learning_rate": 2.9323094627359483e-05,
+      "loss": 0.8321,
+      "step": 265
+    },
+    {
+      "epoch": 0.7248322147651006,
+      "grad_norm": 0.9772825241088867,
+      "learning_rate": 2.928301601158485e-05,
+      "loss": 0.8076,
+      "step": 270
+    },
+    {
+      "epoch": 0.738255033557047,
+      "grad_norm": 1.0267034769058228,
+      "learning_rate": 2.924181377517186e-05,
+      "loss": 0.8036,
+      "step": 275
+    },
+    {
+      "epoch": 0.7516778523489933,
+      "grad_norm": 1.621317982673645,
+      "learning_rate": 2.9199491159430543e-05,
+      "loss": 0.7773,
+      "step": 280
+    },
+    {
+      "epoch": 0.7651006711409396,
+      "grad_norm": 0.8764991760253906,
+      "learning_rate": 2.9156051493809284e-05,
+      "loss": 0.8063,
+      "step": 285
+    },
+    {
+      "epoch": 0.7785234899328859,
+      "grad_norm": 0.888109564781189,
+      "learning_rate": 2.911149819563288e-05,
+      "loss": 0.8066,
+      "step": 290
+    },
+    {
+      "epoch": 0.7919463087248322,
+      "grad_norm": 1.1008830070495605,
+      "learning_rate": 2.9065834769833716e-05,
+      "loss": 0.7899,
+      "step": 295
+    },
+    {
+      "epoch": 0.8053691275167785,
+      "grad_norm": 0.9235763549804688,
+      "learning_rate": 2.9019064808676025e-05,
+      "loss": 0.761,
+      "step": 300
+    },
+    {
+      "epoch": 0.8187919463087249,
+      "grad_norm": 0.8757153153419495,
+      "learning_rate": 2.8971191991473312e-05,
+      "loss": 0.741,
+      "step": 305
+    },
+    {
+      "epoch": 0.8322147651006712,
+      "grad_norm": 0.9042085409164429,
+      "learning_rate": 2.892222008429888e-05,
+      "loss": 0.8085,
+      "step": 310
+    },
+    {
+      "epoch": 0.8456375838926175,
+      "grad_norm": 1.0506023168563843,
+      "learning_rate": 2.8872152939689597e-05,
+      "loss": 0.7472,
+      "step": 315
+    },
+    {
+      "epoch": 0.8590604026845637,
+      "grad_norm": 0.9140369296073914,
+      "learning_rate": 2.882099449634279e-05,
+      "loss": 0.7408,
+      "step": 320
+    },
+    {
+      "epoch": 0.87248322147651,
+      "grad_norm": 0.9501097798347473,
+      "learning_rate": 2.876874877880639e-05,
+      "loss": 0.6912,
+      "step": 325
+    },
+    {
+      "epoch": 0.8859060402684564,
+      "grad_norm": 0.991152822971344,
+      "learning_rate": 2.8715419897162375e-05,
+      "loss": 0.7086,
+      "step": 330
+    },
+    {
+      "epoch": 0.8993288590604027,
+      "grad_norm": 1.201697826385498,
+      "learning_rate": 2.8661012046703393e-05,
+      "loss": 0.7267,
+      "step": 335
+    },
+    {
+      "epoch": 0.912751677852349,
+      "grad_norm": 1.0139992237091064,
+      "learning_rate": 2.8605529507602727e-05,
+      "loss": 0.723,
+      "step": 340
+    },
+    {
+      "epoch": 0.9261744966442953,
+      "grad_norm": 0.9575132131576538,
+      "learning_rate": 2.8548976644577604e-05,
+      "loss": 0.7261,
+      "step": 345
+    },
+    {
+      "epoch": 0.9395973154362416,
+      "grad_norm": 1.1210860013961792,
+      "learning_rate": 2.849135790654582e-05,
+      "loss": 0.6699,
+      "step": 350
+    },
+    {
+      "epoch": 0.9530201342281879,
+      "grad_norm": 1.047195315361023,
+      "learning_rate": 2.843267782627574e-05,
+      "loss": 0.6586,
+      "step": 355
+    },
+    {
+      "epoch": 0.9664429530201343,
+      "grad_norm": 1.0385384559631348,
+      "learning_rate": 2.837294102002973e-05,
+      "loss": 0.732,
+      "step": 360
+    },
+    {
+      "epoch": 0.9798657718120806,
+      "grad_norm": 1.0537059307098389,
+      "learning_rate": 2.831215218720099e-05,
+      "loss": 0.6734,
+      "step": 365
+    },
+    {
+      "epoch": 0.9932885906040269,
+      "grad_norm": 0.9848499894142151,
+      "learning_rate": 2.8250316109943874e-05,
+      "loss": 0.6877,
+      "step": 370
+    },
+    {
+      "epoch": 1.0053691275167784,
+      "grad_norm": 1.0746926069259644,
+      "learning_rate": 2.8187437652797676e-05,
+      "loss": 0.6078,
+      "step": 375
+    },
+    {
+      "epoch": 1.018791946308725,
+      "grad_norm": 1.1550384759902954,
+      "learning_rate": 2.8123521762303944e-05,
+      "loss": 0.5924,
+      "step": 380
+    },
+    {
+      "epoch": 1.0322147651006712,
+      "grad_norm": 0.922566831111908,
+      "learning_rate": 2.805857346661734e-05,
+      "loss": 0.6452,
+      "step": 385
+    },
+    {
+      "epoch": 1.0456375838926175,
+      "grad_norm": 1.3432979583740234,
+      "learning_rate": 2.7992597875110116e-05,
+      "loss": 0.5548,
+      "step": 390
+    },
+    {
+      "epoch": 1.0590604026845638,
+      "grad_norm": 0.939103901386261,
+      "learning_rate": 2.792560017797012e-05,
+      "loss": 0.5999,
+      "step": 395
+    },
+    {
+      "epoch": 1.07248322147651,
+      "grad_norm": 1.120595097541809,
+      "learning_rate": 2.785758564579252e-05,
+      "loss": 0.5774,
+      "step": 400
+    },
+    {
+      "epoch": 1.0859060402684564,
+      "grad_norm": 1.1456084251403809,
+      "learning_rate": 2.778855962916518e-05,
+      "loss": 0.5724,
+      "step": 405
+    },
+    {
+      "epoch": 1.0993288590604027,
+      "grad_norm": 0.9638969302177429,
+      "learning_rate": 2.7718527558247722e-05,
+      "loss": 0.5771,
+      "step": 410
+    },
+    {
+      "epoch": 1.112751677852349,
+      "grad_norm": 1.0665216445922852,
+      "learning_rate": 2.7647494942344363e-05,
+      "loss": 0.5886,
+      "step": 415
+    },
+    {
+      "epoch": 1.1261744966442953,
+      "grad_norm": 0.9236083030700684,
+      "learning_rate": 2.7575467369470473e-05,
+      "loss": 0.5938,
+      "step": 420
+    },
+    {
+      "epoch": 1.1395973154362415,
+      "grad_norm": 1.0757817029953003,
+      "learning_rate": 2.750245050591303e-05,
+      "loss": 0.5929,
+      "step": 425
+    },
+    {
+      "epoch": 1.1530201342281878,
+      "grad_norm": 1.2053022384643555,
+      "learning_rate": 2.742845009578481e-05,
+      "loss": 0.5651,
+      "step": 430
+    },
+    {
+      "epoch": 1.1664429530201343,
+      "grad_norm": 1.1092323064804077,
+      "learning_rate": 2.7353471960572536e-05,
+      "loss": 0.5359,
+      "step": 435
+    },
+    {
+      "epoch": 1.1798657718120806,
+      "grad_norm": 1.0660345554351807,
+      "learning_rate": 2.7277521998678904e-05,
+      "loss": 0.5321,
+      "step": 440
+    },
+    {
+      "epoch": 1.193288590604027,
+      "grad_norm": 0.9836210608482361,
+      "learning_rate": 2.7200606184958567e-05,
+      "loss": 0.54,
+      "step": 445
+    },
+    {
+      "epoch": 1.2067114093959732,
+      "grad_norm": 1.1034318208694458,
+      "learning_rate": 2.7122730570248095e-05,
+      "loss": 0.5081,
+      "step": 450
+    },
+    {
+      "epoch": 1.2201342281879195,
+      "grad_norm": 1.2415367364883423,
+      "learning_rate": 2.704390128088999e-05,
+      "loss": 0.5504,
+      "step": 455
+    },
+    {
+      "epoch": 1.2335570469798658,
+      "grad_norm": 1.6209220886230469,
+      "learning_rate": 2.696412451825071e-05,
+      "loss": 0.4975,
+      "step": 460
+    },
+    {
+      "epoch": 1.246979865771812,
+      "grad_norm": 1.0926618576049805,
+      "learning_rate": 2.6883406558232823e-05,
+      "loss": 0.5557,
+      "step": 465
+    },
+    {
+      "epoch": 1.2604026845637584,
+      "grad_norm": 1.4722034931182861,
+      "learning_rate": 2.6801753750781313e-05,
+      "loss": 0.5254,
+      "step": 470
+    },
+    {
+      "epoch": 1.2738255033557047,
+      "grad_norm": 1.122171401977539,
+      "learning_rate": 2.6719172519384015e-05,
+      "loss": 0.5065,
+      "step": 475
+    },
+    {
+      "epoch": 1.287248322147651,
+      "grad_norm": 1.0232843160629272,
+      "learning_rate": 2.6635669360566298e-05,
+      "loss": 0.5151,
+      "step": 480
+    },
+    {
+      "epoch": 1.3006711409395972,
+      "grad_norm": 1.203287124633789,
+      "learning_rate": 2.6551250843380007e-05,
+      "loss": 0.5436,
+      "step": 485
+    },
+    {
+      "epoch": 1.3140939597315437,
+      "grad_norm": 0.9881830811500549,
+      "learning_rate": 2.6465923608886676e-05,
+      "loss": 0.5222,
+      "step": 490
+    },
+    {
+      "epoch": 1.3275167785234898,
+      "grad_norm": 1.0542739629745483,
+      "learning_rate": 2.6379694369635076e-05,
+      "loss": 0.4936,
+      "step": 495
+    },
+    {
+      "epoch": 1.3409395973154363,
+      "grad_norm": 1.1496102809906006,
+      "learning_rate": 2.6292569909133176e-05,
+      "loss": 0.5277,
+      "step": 500
+    },
+    {
+      "epoch": 1.3543624161073826,
+      "grad_norm": 1.401339054107666,
+      "learning_rate": 2.620455708131447e-05,
+      "loss": 0.479,
+      "step": 505
+    },
+    {
+      "epoch": 1.367785234899329,
+      "grad_norm": 1.1059690713882446,
+      "learning_rate": 2.6115662809998814e-05,
+      "loss": 0.5191,
+      "step": 510
+    },
+    {
+      "epoch": 1.3812080536912752,
+      "grad_norm": 0.9935198426246643,
+      "learning_rate": 2.6025894088347723e-05,
+      "loss": 0.4817,
+      "step": 515
+    },
+    {
+      "epoch": 1.3946308724832215,
+      "grad_norm": 1.139981985092163,
+      "learning_rate": 2.5935257978314233e-05,
+      "loss": 0.4679,
+      "step": 520
+    },
+    {
+      "epoch": 1.4080536912751678,
+      "grad_norm": 1.1750354766845703,
+      "learning_rate": 2.5843761610087354e-05,
+      "loss": 0.4894,
+      "step": 525
+    },
+    {
+      "epoch": 1.421476510067114,
+      "grad_norm": 1.0493199825286865,
+      "learning_rate": 2.5751412181531153e-05,
+      "loss": 0.4955,
+      "step": 530
+    },
+    {
+      "epoch": 1.4348993288590604,
+      "grad_norm": 0.992678165435791,
+      "learning_rate": 2.56582169576185e-05,
+      "loss": 0.4937,
+      "step": 535
+    },
+    {
+      "epoch": 1.4483221476510066,
+      "grad_norm": 1.1107701063156128,
+      "learning_rate": 2.556418326985956e-05,
+      "loss": 0.498,
+      "step": 540
+    },
+    {
+      "epoch": 1.4617449664429532,
+      "grad_norm": 1.1102780103683472,
+      "learning_rate": 2.5469318515725008e-05,
+      "loss": 0.4869,
+      "step": 545
+    },
+    {
+      "epoch": 1.4751677852348992,
+      "grad_norm": 1.0612019300460815,
+      "learning_rate": 2.5373630158064123e-05,
+      "loss": 0.4653,
+      "step": 550
+    },
+    {
+      "epoch": 1.4885906040268457,
+      "grad_norm": 1.0739949941635132,
+      "learning_rate": 2.5277125724517665e-05,
+      "loss": 0.4908,
+      "step": 555
+    },
+    {
+      "epoch": 1.5020134228187918,
+      "grad_norm": 1.082614779472351,
+      "learning_rate": 2.51798128069257e-05,
+      "loss": 0.4714,
+      "step": 560
+    },
+    {
+      "epoch": 1.5154362416107383,
+      "grad_norm": 1.1392021179199219,
+      "learning_rate": 2.5081699060730353e-05,
+      "loss": 0.519,
+      "step": 565
+    },
+    {
+      "epoch": 1.5288590604026846,
+      "grad_norm": 1.22454833984375,
+      "learning_rate": 2.4982792204373603e-05,
+      "loss": 0.4507,
+      "step": 570
+    },
+    {
+      "epoch": 1.542281879194631,
+      "grad_norm": 1.1701927185058594,
+      "learning_rate": 2.4883100018690028e-05,
+      "loss": 0.4418,
+      "step": 575
+    },
+    {
+      "epoch": 1.5557046979865772,
+      "grad_norm": 1.0155186653137207,
+      "learning_rate": 2.4782630346294758e-05,
+      "loss": 0.4817,
+      "step": 580
+    },
+    {
+      "epoch": 1.5691275167785235,
+      "grad_norm": 1.1472469568252563,
+      "learning_rate": 2.4681391090966466e-05,
+      "loss": 0.4083,
+      "step": 585
+    },
+    {
+      "epoch": 1.5825503355704698,
+      "grad_norm": 1.3812817335128784,
+      "learning_rate": 2.4579390217025616e-05,
+      "loss": 0.4777,
+      "step": 590
+    },
+    {
+      "epoch": 1.595973154362416,
+      "grad_norm": 1.07793128490448,
+      "learning_rate": 2.447663574870792e-05,
+      "loss": 0.4615,
+      "step": 595
+    },
+    {
+      "epoch": 1.6093959731543626,
+      "grad_norm": 1.3786406517028809,
+      "learning_rate": 2.4373135769533073e-05,
+      "loss": 0.4206,
+      "step": 600
+    },
+    {
+      "epoch": 1.6228187919463086,
+      "grad_norm": 1.0537580251693726,
+      "learning_rate": 2.426889842166885e-05,
+      "loss": 0.4562,
+      "step": 605
+    },
+    {
+      "epoch": 1.6362416107382551,
+      "grad_norm": 1.1155586242675781,
+      "learning_rate": 2.4163931905290565e-05,
+      "loss": 0.4069,
+      "step": 610
+    },
+    {
+      "epoch": 1.6496644295302012,
+      "grad_norm": 1.0712804794311523,
+      "learning_rate": 2.4058244477935986e-05,
+      "loss": 0.4405,
+      "step": 615
+    },
+    {
+      "epoch": 1.6630872483221477,
+      "grad_norm": 1.2123838663101196,
+      "learning_rate": 2.3951844453855727e-05,
+      "loss": 0.4167,
+      "step": 620
+    },
+    {
+      "epoch": 1.676510067114094,
+      "grad_norm": 1.195296049118042,
+      "learning_rate": 2.3844740203359165e-05,
+      "loss": 0.45,
+      "step": 625
+    },
+    {
+      "epoch": 1.6899328859060403,
+      "grad_norm": 1.0780428647994995,
+      "learning_rate": 2.3736940152155993e-05,
+      "loss": 0.3855,
+      "step": 630
+    },
+    {
+      "epoch": 1.7033557046979866,
+      "grad_norm": 1.1524149179458618,
+      "learning_rate": 2.362845278069335e-05,
+      "loss": 0.4176,
+      "step": 635
+    },
+    {
+      "epoch": 1.7167785234899329,
+      "grad_norm": 1.0923837423324585,
+      "learning_rate": 2.35192866234887e-05,
+      "loss": 0.4392,
+      "step": 640
+    },
+    {
+      "epoch": 1.7302013422818792,
+      "grad_norm": 1.0360054969787598,
+      "learning_rate": 2.340945026845843e-05,
+      "loss": 0.3672,
+      "step": 645
+    },
+    {
+      "epoch": 1.7436241610738255,
+      "grad_norm": 1.1665080785751343,
+      "learning_rate": 2.3298952356242248e-05,
+      "loss": 0.423,
+      "step": 650
+    },
+    {
+      "epoch": 1.757046979865772,
+      "grad_norm": 1.076156735420227,
+      "learning_rate": 2.318780157952345e-05,
+      "loss": 0.4082,
+      "step": 655
+    },
+    {
+      "epoch": 1.770469798657718,
+      "grad_norm": 1.2381459474563599,
+      "learning_rate": 2.3076006682345074e-05,
+      "loss": 0.3961,
+      "step": 660
+    },
+    {
+      "epoch": 1.7838926174496645,
+      "grad_norm": 1.0834969282150269,
+      "learning_rate": 2.296357645942202e-05,
+      "loss": 0.3938,
+      "step": 665
+    },
+    {
+      "epoch": 1.7973154362416106,
+      "grad_norm": 1.1974222660064697,
+      "learning_rate": 2.2850519755449183e-05,
+      "loss": 0.3477,
+      "step": 670
+    },
+    {
+      "epoch": 1.8107382550335571,
+      "grad_norm": 1.2139180898666382,
+      "learning_rate": 2.273684546440566e-05,
+      "loss": 0.4295,
+      "step": 675
+    },
+    {
+      "epoch": 1.8241610738255034,
+      "grad_norm": 1.1445631980895996,
+      "learning_rate": 2.2622562528855092e-05,
+      "loss": 0.3409,
+      "step": 680
+    },
+    {
+      "epoch": 1.8375838926174497,
+      "grad_norm": 1.3161190748214722,
+      "learning_rate": 2.2507679939242123e-05,
+      "loss": 0.393,
+      "step": 685
+    },
+    {
+      "epoch": 1.851006711409396,
+      "grad_norm": 1.1103390455245972,
+      "learning_rate": 2.2392206733185175e-05,
+      "loss": 0.3534,
+      "step": 690
+    },
+    {
+      "epoch": 1.8644295302013423,
+      "grad_norm": 1.1700199842453003,
+      "learning_rate": 2.2276151994765483e-05,
+      "loss": 0.391,
+      "step": 695
+    },
+    {
+      "epoch": 1.8778523489932886,
+      "grad_norm": 1.0810993909835815,
+      "learning_rate": 2.2159524853812422e-05,
+      "loss": 0.3608,
+      "step": 700
+    },
+    {
+      "epoch": 1.8912751677852349,
+      "grad_norm": 1.2553406953811646,
+      "learning_rate": 2.204233448518531e-05,
+      "loss": 0.399,
+      "step": 705
+    },
+    {
+      "epoch": 1.9046979865771814,
+      "grad_norm": 1.1730430126190186,
+      "learning_rate": 2.1924590108051635e-05,
+      "loss": 0.3848,
+      "step": 710
+    },
+    {
+      "epoch": 1.9181208053691274,
+      "grad_norm": 1.0870543718338013,
+      "learning_rate": 2.1806300985161786e-05,
+      "loss": 0.3722,
+      "step": 715
+    },
+    {
+      "epoch": 1.931543624161074,
+      "grad_norm": 1.2579858303070068,
+      "learning_rate": 2.1687476422120397e-05,
+      "loss": 0.376,
+      "step": 720
+    },
+    {
+      "epoch": 1.94496644295302,
+      "grad_norm": 1.0530152320861816,
+      "learning_rate": 2.1568125766654236e-05,
+      "loss": 0.3564,
+      "step": 725
+    },
+    {
+      "epoch": 1.9583892617449665,
+      "grad_norm": 1.1041901111602783,
+      "learning_rate": 2.1448258407876902e-05,
+      "loss": 0.3669,
+      "step": 730
+    },
+    {
+      "epoch": 1.9718120805369126,
+      "grad_norm": 1.2685492038726807,
+      "learning_rate": 2.132788377555016e-05,
+      "loss": 0.3555,
+      "step": 735
+    },
+    {
+      "epoch": 1.985234899328859,
+      "grad_norm": 1.1222282648086548,
+      "learning_rate": 2.12070113393421e-05,
+      "loss": 0.3263,
+      "step": 740
+    },
+    {
+      "epoch": 1.9986577181208054,
+      "grad_norm": 1.1764016151428223,
+      "learning_rate": 2.1085650608082222e-05,
+      "loss": 0.3398,
+      "step": 745
+    },
+    {
+      "epoch": 2.010738255033557,
+      "grad_norm": 1.2533848285675049,
+      "learning_rate": 2.096381112901337e-05,
+      "loss": 0.2537,
+      "step": 750
+    },
+    {
+      "epoch": 2.0241610738255034,
+      "grad_norm": 0.954615592956543,
+      "learning_rate": 2.084150248704067e-05,
+      "loss": 0.2506,
+      "step": 755
+    },
+    {
+      "epoch": 2.03758389261745,
+      "grad_norm": 1.0643949508666992,
+      "learning_rate": 2.071873430397747e-05,
+      "loss": 0.2948,
+      "step": 760
+    },
+    {
+      "epoch": 2.051006711409396,
+      "grad_norm": 1.2162034511566162,
+      "learning_rate": 2.059551623778846e-05,
+      "loss": 0.2649,
+      "step": 765
+    },
+    {
+      "epoch": 2.0644295302013425,
+      "grad_norm": 1.0362883806228638,
+      "learning_rate": 2.0471857981829878e-05,
+      "loss": 0.2913,
+      "step": 770
+    },
+    {
+      "epoch": 2.0778523489932885,
+      "grad_norm": 1.1844029426574707,
+      "learning_rate": 2.0347769264086916e-05,
+      "loss": 0.3286,
+      "step": 775
+    },
+    {
+      "epoch": 2.091275167785235,
+      "grad_norm": 1.0810260772705078,
+      "learning_rate": 2.0223259846408485e-05,
+      "loss": 0.2885,
+      "step": 780
+    },
+    {
+      "epoch": 2.104697986577181,
+      "grad_norm": 1.1697508096694946,
+      "learning_rate": 2.009833952373925e-05,
+      "loss": 0.2795,
+      "step": 785
+    },
+    {
+      "epoch": 2.1181208053691276,
+      "grad_norm": 1.029407024383545,
+      "learning_rate": 1.9973018123349067e-05,
+      "loss": 0.2621,
+      "step": 790
+    },
+    {
+      "epoch": 2.1315436241610737,
+      "grad_norm": 1.0687460899353027,
+      "learning_rate": 1.984730550405989e-05,
+      "loss": 0.2954,
+      "step": 795
+    },
+    {
+      "epoch": 2.14496644295302,
+      "grad_norm": 1.2070434093475342,
+      "learning_rate": 1.9721211555470197e-05,
+      "loss": 0.2972,
+      "step": 800
+    },
+    {
+      "epoch": 2.1583892617449663,
+      "grad_norm": 1.248508095741272,
+      "learning_rate": 1.9594746197177025e-05,
+      "loss": 0.2557,
+      "step": 805
+    },
+    {
+      "epoch": 2.1718120805369128,
+      "grad_norm": 0.9223589897155762,
+      "learning_rate": 1.9467919377995553e-05,
+      "loss": 0.2297,
+      "step": 810
+    },
+    {
+      "epoch": 2.185234899328859,
+      "grad_norm": 1.2255672216415405,
+      "learning_rate": 1.934074107517647e-05,
+      "loss": 0.2845,
+      "step": 815
+    },
+    {
+      "epoch": 2.1986577181208053,
+      "grad_norm": 1.2953975200653076,
+      "learning_rate": 1.9213221293621117e-05,
+      "loss": 0.2187,
+      "step": 820
+    },
+    {
+      "epoch": 2.212080536912752,
+      "grad_norm": 1.223412275314331,
+      "learning_rate": 1.9085370065094367e-05,
+      "loss": 0.2837,
+      "step": 825
+    },
+    {
+      "epoch": 2.225503355704698,
+      "grad_norm": 1.351789951324463,
+      "learning_rate": 1.8957197447435458e-05,
+      "loss": 0.2387,
+      "step": 830
+    },
+    {
+      "epoch": 2.2389261744966444,
+      "grad_norm": 1.0444451570510864,
+      "learning_rate": 1.8828713523766784e-05,
+      "loss": 0.2384,
+      "step": 835
+    },
+    {
+      "epoch": 2.2523489932885905,
+      "grad_norm": 1.1646713018417358,
+      "learning_rate": 1.8699928401700642e-05,
+      "loss": 0.2669,
+      "step": 840
+    },
+    {
+      "epoch": 2.265771812080537,
+      "grad_norm": 1.4678176641464233,
+      "learning_rate": 1.8570852212544108e-05,
+      "loss": 0.2723,
+      "step": 845
+    },
+    {
+      "epoch": 2.279194630872483,
+      "grad_norm": 1.0317347049713135,
+      "learning_rate": 1.8441495110501986e-05,
+      "loss": 0.2644,
+      "step": 850
+    },
+    {
+      "epoch": 2.2926174496644296,
+      "grad_norm": 1.382323980331421,
+      "learning_rate": 1.8311867271878057e-05,
+      "loss": 0.276,
+      "step": 855
+    },
+    {
+      "epoch": 2.3060402684563757,
+      "grad_norm": 1.1337871551513672,
+      "learning_rate": 1.818197889427446e-05,
+      "loss": 0.2535,
+      "step": 860
+    },
+    {
+      "epoch": 2.319463087248322,
+      "grad_norm": 1.0053396224975586,
+      "learning_rate": 1.805184019578951e-05,
+      "loss": 0.2476,
+      "step": 865
+    },
+    {
+      "epoch": 2.3328859060402687,
+      "grad_norm": 1.1298002004623413,
+      "learning_rate": 1.792146141421383e-05,
+      "loss": 0.265,
+      "step": 870
+    },
+    {
+      "epoch": 2.3463087248322148,
+      "grad_norm": 1.119316577911377,
+      "learning_rate": 1.7790852806224978e-05,
+      "loss": 0.2585,
+      "step": 875
+    },
+    {
+      "epoch": 2.3597315436241613,
+      "grad_norm": 1.1393946409225464,
+      "learning_rate": 1.7660024646580573e-05,
+      "loss": 0.245,
+      "step": 880
+    },
+    {
+      "epoch": 2.3731543624161073,
+      "grad_norm": 1.089766502380371,
+      "learning_rate": 1.7528987227309974e-05,
+      "loss": 0.2598,
+      "step": 885
+    },
+    {
+      "epoch": 2.386577181208054,
+      "grad_norm": 1.2876455783843994,
+      "learning_rate": 1.7397750856904653e-05,
+      "loss": 0.2398,
+      "step": 890
+    },
+    {
+      "epoch": 2.4,
+      "grad_norm": 1.247550129890442,
+      "learning_rate": 1.7266325859507228e-05,
+      "loss": 0.2306,
+      "step": 895
+    },
+    {
+      "epoch": 2.4134228187919464,
+      "grad_norm": 1.1357083320617676,
+      "learning_rate": 1.713472257409928e-05,
+      "loss": 0.2652,
+      "step": 900
+    },
+    {
+      "epoch": 2.4268456375838925,
+      "grad_norm": 1.4018968343734741,
+      "learning_rate": 1.7002951353688e-05,
+      "loss": 0.2237,
+      "step": 905
+    },
+    {
+      "epoch": 2.440268456375839,
+      "grad_norm": 1.1884123086929321,
+      "learning_rate": 1.6871022564491753e-05,
+      "loss": 0.266,
+      "step": 910
+    },
+    {
+      "epoch": 2.453691275167785,
+      "grad_norm": 1.113447904586792,
+      "learning_rate": 1.6738946585124565e-05,
+      "loss": 0.2105,
+      "step": 915
+    },
+    {
+      "epoch": 2.4671140939597316,
+      "grad_norm": 1.12835693359375,
+      "learning_rate": 1.6606733805779663e-05,
+      "loss": 0.2465,
+      "step": 920
+    },
+    {
+      "epoch": 2.4805369127516776,
+      "grad_norm": 1.2548681497573853,
+      "learning_rate": 1.64743946274121e-05,
+      "loss": 0.213,
+      "step": 925
+    },
+    {
+      "epoch": 2.493959731543624,
+      "grad_norm": 1.1149048805236816,
+      "learning_rate": 1.6341939460920524e-05,
+      "loss": 0.2353,
+      "step": 930
+    },
+    {
+      "epoch": 2.5073825503355707,
+      "grad_norm": 1.0424855947494507,
+      "learning_rate": 1.6209378726328168e-05,
+      "loss": 0.2084,
+      "step": 935
+    },
+    {
+      "epoch": 2.5208053691275167,
+      "grad_norm": 1.1017965078353882,
+      "learning_rate": 1.6076722851963135e-05,
+      "loss": 0.2135,
+      "step": 940
+    },
+    {
+      "epoch": 2.5342281879194632,
+      "grad_norm": 1.112675666809082,
+      "learning_rate": 1.5943982273638008e-05,
+      "loss": 0.2233,
+      "step": 945
+    },
+    {
+      "epoch": 2.5476510067114093,
+      "grad_norm": 1.1784453392028809,
+      "learning_rate": 1.581116743382889e-05,
+      "loss": 0.2044,
+      "step": 950
+    },
+    {
+      "epoch": 2.561073825503356,
+      "grad_norm": 1.257094144821167,
+      "learning_rate": 1.5678288780853903e-05,
+      "loss": 0.2434,
+      "step": 955
+    },
+    {
+      "epoch": 2.574496644295302,
+      "grad_norm": 1.1470935344696045,
+      "learning_rate": 1.554535676805125e-05,
+      "loss": 0.2308,
+      "step": 960
+    },
+    {
+      "epoch": 2.5879194630872484,
+      "grad_norm": 1.0138940811157227,
+      "learning_rate": 1.5412381852956858e-05,
+      "loss": 0.2136,
+      "step": 965
+    },
+    {
+      "epoch": 2.6013422818791945,
+      "grad_norm": 1.0910656452178955,
+      "learning_rate": 1.5279374496481708e-05,
+      "loss": 0.1905,
+      "step": 970
+    },
+    {
+      "epoch": 2.614765100671141,
+      "grad_norm": 1.3550100326538086,
+      "learning_rate": 1.5146345162088871e-05,
+      "loss": 0.2043,
+      "step": 975
+    },
+    {
+      "epoch": 2.6281879194630875,
+      "grad_norm": 1.0329605340957642,
+      "learning_rate": 1.5013304314970414e-05,
+      "loss": 0.2153,
+      "step": 980
+    },
+    {
+      "epoch": 2.6416107382550336,
+      "grad_norm": 1.252626657485962,
+      "learning_rate": 1.4880262421224075e-05,
+      "loss": 0.2115,
+      "step": 985
+    },
+    {
+      "epoch": 2.6550335570469796,
+      "grad_norm": 1.0826952457427979,
+      "learning_rate": 1.4747229947029918e-05,
+      "loss": 0.2011,
+      "step": 990
+    },
+    {
+      "epoch": 2.668456375838926,
+      "grad_norm": 1.119947910308838,
+      "learning_rate": 1.4614217357826995e-05,
+      "loss": 0.2054,
+      "step": 995
+    },
+    {
+      "epoch": 2.6818791946308727,
+      "grad_norm": 1.2884759902954102,
+      "learning_rate": 1.4481235117490053e-05,
+      "loss": 0.1993,
+      "step": 1000
+    },
+    {
+      "epoch": 2.6953020134228187,
+      "grad_norm": 1.221529483795166,
+      "learning_rate": 1.434829368750633e-05,
+      "loss": 0.2035,
+      "step": 1005
+    },
+    {
+      "epoch": 2.7087248322147652,
+      "grad_norm": 1.2037923336029053,
+      "learning_rate": 1.4215403526152583e-05,
+      "loss": 0.2078,
+      "step": 1010
+    },
+    {
+      "epoch": 2.7221476510067113,
+      "grad_norm": 1.3824589252471924,
+      "learning_rate": 1.4082575087672363e-05,
+      "loss": 0.2169,
+      "step": 1015
+    },
+    {
+      "epoch": 2.735570469798658,
+      "grad_norm": 1.1602627038955688,
+      "learning_rate": 1.3949818821453573e-05,
+      "loss": 0.2305,
+      "step": 1020
+    },
+    {
+      "epoch": 2.748993288590604,
+      "grad_norm": 1.1708049774169922,
+      "learning_rate": 1.3817145171206455e-05,
+      "loss": 0.2331,
+      "step": 1025
+    },
+    {
+      "epoch": 2.7624161073825504,
+      "grad_norm": 1.0684705972671509,
+      "learning_rate": 1.3684564574141992e-05,
+      "loss": 0.2126,
+      "step": 1030
+    },
+    {
+      "epoch": 2.7758389261744965,
+      "grad_norm": 1.5986040830612183,
+      "learning_rate": 1.3552087460150836e-05,
+      "loss": 0.1915,
+      "step": 1035
+    },
+    {
+      "epoch": 2.789261744966443,
+      "grad_norm": 1.1710542440414429,
+      "learning_rate": 1.3419724250982795e-05,
+      "loss": 0.2027,
+      "step": 1040
+    },
+    {
+      "epoch": 2.8026845637583895,
+      "grad_norm": 1.1092584133148193,
+      "learning_rate": 1.3287485359426974e-05,
+      "loss": 0.213,
+      "step": 1045
+    },
+    {
+      "epoch": 2.8161073825503355,
+      "grad_norm": 1.3413935899734497,
+      "learning_rate": 1.3155381188492633e-05,
+      "loss": 0.196,
+      "step": 1050
+    },
+    {
+      "epoch": 2.8295302013422816,
+      "grad_norm": 1.2237411737442017,
+      "learning_rate": 1.3023422130590778e-05,
+      "loss": 0.1919,
+      "step": 1055
+    },
+    {
+      "epoch": 2.842953020134228,
+      "grad_norm": 1.0657061338424683,
+      "learning_rate": 1.2891618566716627e-05,
+      "loss": 0.2022,
+      "step": 1060
+    },
+    {
+      "epoch": 2.8563758389261746,
+      "grad_norm": 1.2195141315460205,
+      "learning_rate": 1.275998086563294e-05,
+      "loss": 0.2053,
+      "step": 1065
+    },
+    {
+      "epoch": 2.8697986577181207,
+      "grad_norm": 1.0146418809890747,
+      "learning_rate": 1.2628519383054343e-05,
+      "loss": 0.1876,
+      "step": 1070
+    },
+    {
+      "epoch": 2.883221476510067,
+      "grad_norm": 1.0290652513504028,
+      "learning_rate": 1.2497244460832644e-05,
+      "loss": 0.1743,
+      "step": 1075
+    },
+    {
+      "epoch": 2.8966442953020133,
+      "grad_norm": 1.272923231124878,
+      "learning_rate": 1.2366166426143262e-05,
+      "loss": 0.2261,
+      "step": 1080
+    },
+    {
+      "epoch": 2.91006711409396,
+      "grad_norm": 1.1233028173446655,
+      "learning_rate": 1.2235295590672816e-05,
+      "loss": 0.1996,
+      "step": 1085
+    },
+    {
+      "epoch": 2.9234899328859063,
+      "grad_norm": 1.1806379556655884,
+      "learning_rate": 1.2104642249807901e-05,
+      "loss": 0.1859,
+      "step": 1090
+    },
+    {
+      "epoch": 2.9369127516778524,
+      "grad_norm": 1.259519100189209,
+      "learning_rate": 1.1974216681825193e-05,
+      "loss": 0.1911,
+      "step": 1095
+    },
+    {
+      "epoch": 2.9503355704697984,
+      "grad_norm": 1.2275830507278442,
+      "learning_rate": 1.1844029147082853e-05,
+      "loss": 0.2005,
+      "step": 1100
+    },
+    {
+      "epoch": 2.963758389261745,
+      "grad_norm": 1.0096421241760254,
+      "learning_rate": 1.1714089887213382e-05,
+      "loss": 0.2013,
+      "step": 1105
+    },
+    {
+      "epoch": 2.9771812080536915,
+      "grad_norm": 1.4130147695541382,
+      "learning_rate": 1.1584409124317906e-05,
+      "loss": 0.185,
+      "step": 1110
+    },
+    {
+      "epoch": 2.9906040268456375,
+      "grad_norm": 1.0541446208953857,
+      "learning_rate": 1.1454997060162038e-05,
+      "loss": 0.1766,
+      "step": 1115
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1865,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.489123253640233e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

15_128_e5_3e-5/checkpoint-1119/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:074596f08a98a2a8988623657ab3d8064774b9b44f8ab9eecb22b494b6b7d866
+size 7736

15_128_e5_3e-5/checkpoint-1119/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1119/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

15_128_e5_3e-5/checkpoint-1492/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

15_128_e5_3e-5/checkpoint-1492/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "gate_proj",
+    "down_proj",
+    "up_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

15_128_e5_3e-5/checkpoint-1492/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c3e6413f22ba940e8faf7656221e3b2e54405ced390ac960c4ec28776be9f83
+size 791751704

15_128_e5_3e-5/checkpoint-1492/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1492

15_128_e5_3e-5/checkpoint-1492/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1492/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07ea94d61d7e3a5e23c2b9bf56ddea84d1c23164e546a6b7e3710d742ac85017
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ddcd047c1adfb1aae2f4340200fb28bf072778f1076f5ff0e0f4165758d18fd
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b32d4f9c6f58b4c2f3770ab41e549bba0bdd8ccc7ff9a3e8dd5fea224ddad631
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bcc88d0ad9c0322b5ae6b57c879cce3e9a5c3b6fccf16d3561adeeb13d7cc198
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a2078de328fab8a8c2baba708c5f283e92ad91ede2e19a31a504383c8fdbf9a
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6469036680da78e530c11b63b353d06897943b2dbd191e6638d3971762a82274
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb39f967c19187cc98cf4fd637dd8d422e43273ec389dda0dd6b1e00593d9933
+size 15920

15_128_e5_3e-5/checkpoint-1492/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d88b1007b3f688f1a64f886a40778280109fb44d8a34fb6f421e1bb927806ac6
+size 15920

15_128_e5_3e-5/checkpoint-1492/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58b3b64bedd7a90ca3a522fd0535610ca0ae56ce89d86127309697854f164438
+size 1064

15_128_e5_3e-5/checkpoint-1492/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

15_128_e5_3e-5/checkpoint-1492/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1492/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

15_128_e5_3e-5/checkpoint-1492/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2120 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "eval_steps": 500,
+  "global_step": 1492,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.013422818791946308,
+      "grad_norm": 1.2052627801895142,
+      "learning_rate": 1.276595744680851e-06,
+      "loss": 1.3229,
+      "step": 5
+    },
+    {
+      "epoch": 0.026845637583892617,
+      "grad_norm": 1.3943337202072144,
+      "learning_rate": 2.872340425531915e-06,
+      "loss": 1.3715,
+      "step": 10
+    },
+    {
+      "epoch": 0.040268456375838924,
+      "grad_norm": 0.7524759769439697,
+      "learning_rate": 4.468085106382979e-06,
+      "loss": 1.3464,
+      "step": 15
+    },
+    {
+      "epoch": 0.053691275167785234,
+      "grad_norm": 0.5896478295326233,
+      "learning_rate": 6.063829787234042e-06,
+      "loss": 1.2983,
+      "step": 20
+    },
+    {
+      "epoch": 0.06711409395973154,
+      "grad_norm": 0.6688717007637024,
+      "learning_rate": 7.659574468085105e-06,
+      "loss": 1.324,
+      "step": 25
+    },
+    {
+      "epoch": 0.08053691275167785,
+      "grad_norm": 0.6408427357673645,
+      "learning_rate": 9.255319148936171e-06,
+      "loss": 1.2316,
+      "step": 30
+    },
+    {
+      "epoch": 0.09395973154362416,
+      "grad_norm": 0.6618767380714417,
+      "learning_rate": 1.0851063829787235e-05,
+      "loss": 1.2444,
+      "step": 35
+    },
+    {
+      "epoch": 0.10738255033557047,
+      "grad_norm": 0.5582337379455566,
+      "learning_rate": 1.2446808510638298e-05,
+      "loss": 1.2841,
+      "step": 40
+    },
+    {
+      "epoch": 0.12080536912751678,
+      "grad_norm": 0.5900721549987793,
+      "learning_rate": 1.4042553191489362e-05,
+      "loss": 1.1743,
+      "step": 45
+    },
+    {
+      "epoch": 0.1342281879194631,
+      "grad_norm": 0.5596632957458496,
+      "learning_rate": 1.5638297872340426e-05,
+      "loss": 1.2086,
+      "step": 50
+    },
+    {
+      "epoch": 0.1476510067114094,
+      "grad_norm": 0.4909383952617645,
+      "learning_rate": 1.723404255319149e-05,
+      "loss": 1.2367,
+      "step": 55
+    },
+    {
+      "epoch": 0.1610738255033557,
+      "grad_norm": 0.5068446397781372,
+      "learning_rate": 1.8829787234042554e-05,
+      "loss": 1.2154,
+      "step": 60
+    },
+    {
+      "epoch": 0.174496644295302,
+      "grad_norm": 0.5289482474327087,
+      "learning_rate": 2.0425531914893616e-05,
+      "loss": 1.2072,
+      "step": 65
+    },
+    {
+      "epoch": 0.18791946308724833,
+      "grad_norm": 0.5647903680801392,
+      "learning_rate": 2.2021276595744682e-05,
+      "loss": 1.1745,
+      "step": 70
+    },
+    {
+      "epoch": 0.20134228187919462,
+      "grad_norm": 0.7343299388885498,
+      "learning_rate": 2.3617021276595744e-05,
+      "loss": 1.1381,
+      "step": 75
+    },
+    {
+      "epoch": 0.21476510067114093,
+      "grad_norm": 0.49988463521003723,
+      "learning_rate": 2.521276595744681e-05,
+      "loss": 1.2065,
+      "step": 80
+    },
+    {
+      "epoch": 0.22818791946308725,
+      "grad_norm": 0.5250614285469055,
+      "learning_rate": 2.6808510638297873e-05,
+      "loss": 1.1187,
+      "step": 85
+    },
+    {
+      "epoch": 0.24161073825503357,
+      "grad_norm": 0.5745874643325806,
+      "learning_rate": 2.8404255319148935e-05,
+      "loss": 1.1685,
+      "step": 90
+    },
+    {
+      "epoch": 0.2550335570469799,
+      "grad_norm": 0.661177396774292,
+      "learning_rate": 3e-05,
+      "loss": 1.1199,
+      "step": 95
+    },
+    {
+      "epoch": 0.2684563758389262,
+      "grad_norm": 0.6243076920509338,
+      "learning_rate": 2.999940998772382e-05,
+      "loss": 1.1587,
+      "step": 100
+    },
+    {
+      "epoch": 0.28187919463087246,
+      "grad_norm": 0.5835176706314087,
+      "learning_rate": 2.9997639997310543e-05,
+      "loss": 1.12,
+      "step": 105
+    },
+    {
+      "epoch": 0.2953020134228188,
+      "grad_norm": 0.573341965675354,
+      "learning_rate": 2.9994690168002316e-05,
+      "loss": 1.1094,
+      "step": 110
+    },
+    {
+      "epoch": 0.3087248322147651,
+      "grad_norm": 0.655681848526001,
+      "learning_rate": 2.9990560731857203e-05,
+      "loss": 1.0707,
+      "step": 115
+    },
+    {
+      "epoch": 0.3221476510067114,
+      "grad_norm": 0.7921518087387085,
+      "learning_rate": 2.9985252013730937e-05,
+      "loss": 1.0758,
+      "step": 120
+    },
+    {
+      "epoch": 0.33557046979865773,
+      "grad_norm": 0.7320600748062134,
+      "learning_rate": 2.9978764431251368e-05,
+      "loss": 1.0883,
+      "step": 125
+    },
+    {
+      "epoch": 0.348993288590604,
+      "grad_norm": 0.6482290625572205,
+      "learning_rate": 2.9971098494785612e-05,
+      "loss": 1.065,
+      "step": 130
+    },
+    {
+      "epoch": 0.3624161073825503,
+      "grad_norm": 0.6432045698165894,
+      "learning_rate": 2.9962254807399876e-05,
+      "loss": 1.0591,
+      "step": 135
+    },
+    {
+      "epoch": 0.37583892617449666,
+      "grad_norm": 0.879544734954834,
+      "learning_rate": 2.9952234064812045e-05,
+      "loss": 1.0341,
+      "step": 140
+    },
+    {
+      "epoch": 0.38926174496644295,
+      "grad_norm": 0.5897073149681091,
+      "learning_rate": 2.9941037055336938e-05,
+      "loss": 0.9921,
+      "step": 145
+    },
+    {
+      "epoch": 0.40268456375838924,
+      "grad_norm": 0.6965837478637695,
+      "learning_rate": 2.9928664659824302e-05,
+      "loss": 0.9438,
+      "step": 150
+    },
+    {
+      "epoch": 0.4161073825503356,
+      "grad_norm": 0.7513095736503601,
+      "learning_rate": 2.991511785158949e-05,
+      "loss": 1.0137,
+      "step": 155
+    },
+    {
+      "epoch": 0.42953020134228187,
+      "grad_norm": 0.6117614507675171,
+      "learning_rate": 2.990039769633693e-05,
+      "loss": 1.046,
+      "step": 160
+    },
+    {
+      "epoch": 0.4429530201342282,
+      "grad_norm": 0.7419096827507019,
+      "learning_rate": 2.9884505352076267e-05,
+      "loss": 0.982,
+      "step": 165
+    },
+    {
+      "epoch": 0.4563758389261745,
+      "grad_norm": 0.7812432646751404,
+      "learning_rate": 2.986744206903125e-05,
+      "loss": 0.9778,
+      "step": 170
+    },
+    {
+      "epoch": 0.4697986577181208,
+      "grad_norm": 0.759610652923584,
+      "learning_rate": 2.984920918954142e-05,
+      "loss": 0.9469,
+      "step": 175
+    },
+    {
+      "epoch": 0.48322147651006714,
+      "grad_norm": 0.6890200972557068,
+      "learning_rate": 2.982980814795647e-05,
+      "loss": 0.9265,
+      "step": 180
+    },
+    {
+      "epoch": 0.4966442953020134,
+      "grad_norm": 0.8574821949005127,
+      "learning_rate": 2.980924047052343e-05,
+      "loss": 1.0086,
+      "step": 185
+    },
+    {
+      "epoch": 0.5100671140939598,
+      "grad_norm": 0.8379565477371216,
+      "learning_rate": 2.9787507775266585e-05,
+      "loss": 0.9134,
+      "step": 190
+    },
+    {
+      "epoch": 0.5234899328859061,
+      "grad_norm": 0.7856785655021667,
+      "learning_rate": 2.9764611771860203e-05,
+      "loss": 0.9333,
+      "step": 195
+    },
+    {
+      "epoch": 0.5369127516778524,
+      "grad_norm": 0.7260093092918396,
+      "learning_rate": 2.974055426149403e-05,
+      "loss": 0.9066,
+      "step": 200
+    },
+    {
+      "epoch": 0.5503355704697986,
+      "grad_norm": 0.8396151065826416,
+      "learning_rate": 2.9715337136731593e-05,
+      "loss": 0.917,
+      "step": 205
+    },
+    {
+      "epoch": 0.5637583892617449,
+      "grad_norm": 0.8800502419471741,
+      "learning_rate": 2.9688962381361317e-05,
+      "loss": 0.9126,
+      "step": 210
+    },
+    {
+      "epoch": 0.5771812080536913,
+      "grad_norm": 0.833662748336792,
+      "learning_rate": 2.966143207024046e-05,
+      "loss": 0.9177,
+      "step": 215
+    },
+    {
+      "epoch": 0.5906040268456376,
+      "grad_norm": 0.8988651633262634,
+      "learning_rate": 2.9632748369131893e-05,
+      "loss": 0.8965,
+      "step": 220
+    },
+    {
+      "epoch": 0.6040268456375839,
+      "grad_norm": 0.757098376750946,
+      "learning_rate": 2.9602913534533717e-05,
+      "loss": 0.8824,
+      "step": 225
+    },
+    {
+      "epoch": 0.6174496644295302,
+      "grad_norm": 0.7813494205474854,
+      "learning_rate": 2.9571929913501764e-05,
+      "loss": 0.8686,
+      "step": 230
+    },
+    {
+      "epoch": 0.6308724832214765,
+      "grad_norm": 0.9873630404472351,
+      "learning_rate": 2.9539799943464923e-05,
+      "loss": 0.8493,
+      "step": 235
+    },
+    {
+      "epoch": 0.6442953020134228,
+      "grad_norm": 0.843078076839447,
+      "learning_rate": 2.9506526152033436e-05,
+      "loss": 0.8322,
+      "step": 240
+    },
+    {
+      "epoch": 0.6577181208053692,
+      "grad_norm": 0.9392512440681458,
+      "learning_rate": 2.947211115680003e-05,
+      "loss": 0.8489,
+      "step": 245
+    },
+    {
+      "epoch": 0.6711409395973155,
+      "grad_norm": 0.9625651240348816,
+      "learning_rate": 2.943655766513399e-05,
+      "loss": 0.8091,
+      "step": 250
+    },
+    {
+      "epoch": 0.6845637583892618,
+      "grad_norm": 1.0299538373947144,
+      "learning_rate": 2.939986847396818e-05,
+      "loss": 0.8257,
+      "step": 255
+    },
+    {
+      "epoch": 0.697986577181208,
+      "grad_norm": 0.9097337126731873,
+      "learning_rate": 2.936204646957904e-05,
+      "loss": 0.8299,
+      "step": 260
+    },
+    {
+      "epoch": 0.7114093959731543,
+      "grad_norm": 0.9454818964004517,
+      "learning_rate": 2.9323094627359483e-05,
+      "loss": 0.8321,
+      "step": 265
+    },
+    {
+      "epoch": 0.7248322147651006,
+      "grad_norm": 0.9772825241088867,
+      "learning_rate": 2.928301601158485e-05,
+      "loss": 0.8076,
+      "step": 270
+    },
+    {
+      "epoch": 0.738255033557047,
+      "grad_norm": 1.0267034769058228,
+      "learning_rate": 2.924181377517186e-05,
+      "loss": 0.8036,
+      "step": 275
+    },
+    {
+      "epoch": 0.7516778523489933,
+      "grad_norm": 1.621317982673645,
+      "learning_rate": 2.9199491159430543e-05,
+      "loss": 0.7773,
+      "step": 280
+    },
+    {
+      "epoch": 0.7651006711409396,
+      "grad_norm": 0.8764991760253906,
+      "learning_rate": 2.9156051493809284e-05,
+      "loss": 0.8063,
+      "step": 285
+    },
+    {
+      "epoch": 0.7785234899328859,
+      "grad_norm": 0.888109564781189,
+      "learning_rate": 2.911149819563288e-05,
+      "loss": 0.8066,
+      "step": 290
+    },
+    {
+      "epoch": 0.7919463087248322,
+      "grad_norm": 1.1008830070495605,
+      "learning_rate": 2.9065834769833716e-05,
+      "loss": 0.7899,
+      "step": 295
+    },
+    {
+      "epoch": 0.8053691275167785,
+      "grad_norm": 0.9235763549804688,
+      "learning_rate": 2.9019064808676025e-05,
+      "loss": 0.761,
+      "step": 300
+    },
+    {
+      "epoch": 0.8187919463087249,
+      "grad_norm": 0.8757153153419495,
+      "learning_rate": 2.8971191991473312e-05,
+      "loss": 0.741,
+      "step": 305
+    },
+    {
+      "epoch": 0.8322147651006712,
+      "grad_norm": 0.9042085409164429,
+      "learning_rate": 2.892222008429888e-05,
+      "loss": 0.8085,
+      "step": 310
+    },
+    {
+      "epoch": 0.8456375838926175,
+      "grad_norm": 1.0506023168563843,
+      "learning_rate": 2.8872152939689597e-05,
+      "loss": 0.7472,
+      "step": 315
+    },
+    {
+      "epoch": 0.8590604026845637,
+      "grad_norm": 0.9140369296073914,
+      "learning_rate": 2.882099449634279e-05,
+      "loss": 0.7408,
+      "step": 320
+    },
+    {
+      "epoch": 0.87248322147651,
+      "grad_norm": 0.9501097798347473,
+      "learning_rate": 2.876874877880639e-05,
+      "loss": 0.6912,
+      "step": 325
+    },
+    {
+      "epoch": 0.8859060402684564,
+      "grad_norm": 0.991152822971344,
+      "learning_rate": 2.8715419897162375e-05,
+      "loss": 0.7086,
+      "step": 330
+    },
+    {
+      "epoch": 0.8993288590604027,
+      "grad_norm": 1.201697826385498,
+      "learning_rate": 2.8661012046703393e-05,
+      "loss": 0.7267,
+      "step": 335
+    },
+    {
+      "epoch": 0.912751677852349,
+      "grad_norm": 1.0139992237091064,
+      "learning_rate": 2.8605529507602727e-05,
+      "loss": 0.723,
+      "step": 340
+    },
+    {
+      "epoch": 0.9261744966442953,
+      "grad_norm": 0.9575132131576538,
+      "learning_rate": 2.8548976644577604e-05,
+      "loss": 0.7261,
+      "step": 345
+    },
+    {
+      "epoch": 0.9395973154362416,
+      "grad_norm": 1.1210860013961792,
+      "learning_rate": 2.849135790654582e-05,
+      "loss": 0.6699,
+      "step": 350
+    },
+    {
+      "epoch": 0.9530201342281879,
+      "grad_norm": 1.047195315361023,
+      "learning_rate": 2.843267782627574e-05,
+      "loss": 0.6586,
+      "step": 355
+    },
+    {
+      "epoch": 0.9664429530201343,
+      "grad_norm": 1.0385384559631348,
+      "learning_rate": 2.837294102002973e-05,
+      "loss": 0.732,
+      "step": 360
+    },
+    {
+      "epoch": 0.9798657718120806,
+      "grad_norm": 1.0537059307098389,
+      "learning_rate": 2.831215218720099e-05,
+      "loss": 0.6734,
+      "step": 365
+    },
+    {
+      "epoch": 0.9932885906040269,
+      "grad_norm": 0.9848499894142151,
+      "learning_rate": 2.8250316109943874e-05,
+      "loss": 0.6877,
+      "step": 370
+    },
+    {
+      "epoch": 1.0053691275167784,
+      "grad_norm": 1.0746926069259644,
+      "learning_rate": 2.8187437652797676e-05,
+      "loss": 0.6078,
+      "step": 375
+    },
+    {
+      "epoch": 1.018791946308725,
+      "grad_norm": 1.1550384759902954,
+      "learning_rate": 2.8123521762303944e-05,
+      "loss": 0.5924,
+      "step": 380
+    },
+    {
+      "epoch": 1.0322147651006712,
+      "grad_norm": 0.922566831111908,
+      "learning_rate": 2.805857346661734e-05,
+      "loss": 0.6452,
+      "step": 385
+    },
+    {
+      "epoch": 1.0456375838926175,
+      "grad_norm": 1.3432979583740234,
+      "learning_rate": 2.7992597875110116e-05,
+      "loss": 0.5548,
+      "step": 390
+    },
+    {
+      "epoch": 1.0590604026845638,
+      "grad_norm": 0.939103901386261,
+      "learning_rate": 2.792560017797012e-05,
+      "loss": 0.5999,
+      "step": 395
+    },
+    {
+      "epoch": 1.07248322147651,
+      "grad_norm": 1.120595097541809,
+      "learning_rate": 2.785758564579252e-05,
+      "loss": 0.5774,
+      "step": 400
+    },
+    {
+      "epoch": 1.0859060402684564,
+      "grad_norm": 1.1456084251403809,
+      "learning_rate": 2.778855962916518e-05,
+      "loss": 0.5724,
+      "step": 405
+    },
+    {
+      "epoch": 1.0993288590604027,
+      "grad_norm": 0.9638969302177429,
+      "learning_rate": 2.7718527558247722e-05,
+      "loss": 0.5771,
+      "step": 410
+    },
+    {
+      "epoch": 1.112751677852349,
+      "grad_norm": 1.0665216445922852,
+      "learning_rate": 2.7647494942344363e-05,
+      "loss": 0.5886,
+      "step": 415
+    },
+    {
+      "epoch": 1.1261744966442953,
+      "grad_norm": 0.9236083030700684,
+      "learning_rate": 2.7575467369470473e-05,
+      "loss": 0.5938,
+      "step": 420
+    },
+    {
+      "epoch": 1.1395973154362415,
+      "grad_norm": 1.0757817029953003,
+      "learning_rate": 2.750245050591303e-05,
+      "loss": 0.5929,
+      "step": 425
+    },
+    {
+      "epoch": 1.1530201342281878,
+      "grad_norm": 1.2053022384643555,
+      "learning_rate": 2.742845009578481e-05,
+      "loss": 0.5651,
+      "step": 430
+    },
+    {
+      "epoch": 1.1664429530201343,
+      "grad_norm": 1.1092323064804077,
+      "learning_rate": 2.7353471960572536e-05,
+      "loss": 0.5359,
+      "step": 435
+    },
+    {
+      "epoch": 1.1798657718120806,
+      "grad_norm": 1.0660345554351807,
+      "learning_rate": 2.7277521998678904e-05,
+      "loss": 0.5321,
+      "step": 440
+    },
+    {
+      "epoch": 1.193288590604027,
+      "grad_norm": 0.9836210608482361,
+      "learning_rate": 2.7200606184958567e-05,
+      "loss": 0.54,
+      "step": 445
+    },
+    {
+      "epoch": 1.2067114093959732,
+      "grad_norm": 1.1034318208694458,
+      "learning_rate": 2.7122730570248095e-05,
+      "loss": 0.5081,
+      "step": 450
+    },
+    {
+      "epoch": 1.2201342281879195,
+      "grad_norm": 1.2415367364883423,
+      "learning_rate": 2.704390128088999e-05,
+      "loss": 0.5504,
+      "step": 455
+    },
+    {
+      "epoch": 1.2335570469798658,
+      "grad_norm": 1.6209220886230469,
+      "learning_rate": 2.696412451825071e-05,
+      "loss": 0.4975,
+      "step": 460
+    },
+    {
+      "epoch": 1.246979865771812,
+      "grad_norm": 1.0926618576049805,
+      "learning_rate": 2.6883406558232823e-05,
+      "loss": 0.5557,
+      "step": 465
+    },
+    {
+      "epoch": 1.2604026845637584,
+      "grad_norm": 1.4722034931182861,
+      "learning_rate": 2.6801753750781313e-05,
+      "loss": 0.5254,
+      "step": 470
+    },
+    {
+      "epoch": 1.2738255033557047,
+      "grad_norm": 1.122171401977539,
+      "learning_rate": 2.6719172519384015e-05,
+      "loss": 0.5065,
+      "step": 475
+    },
+    {
+      "epoch": 1.287248322147651,
+      "grad_norm": 1.0232843160629272,
+      "learning_rate": 2.6635669360566298e-05,
+      "loss": 0.5151,
+      "step": 480
+    },
+    {
+      "epoch": 1.3006711409395972,
+      "grad_norm": 1.203287124633789,
+      "learning_rate": 2.6551250843380007e-05,
+      "loss": 0.5436,
+      "step": 485
+    },
+    {
+      "epoch": 1.3140939597315437,
+      "grad_norm": 0.9881830811500549,
+      "learning_rate": 2.6465923608886676e-05,
+      "loss": 0.5222,
+      "step": 490
+    },
+    {
+      "epoch": 1.3275167785234898,
+      "grad_norm": 1.0542739629745483,
+      "learning_rate": 2.6379694369635076e-05,
+      "loss": 0.4936,
+      "step": 495
+    },
+    {
+      "epoch": 1.3409395973154363,
+      "grad_norm": 1.1496102809906006,
+      "learning_rate": 2.6292569909133176e-05,
+      "loss": 0.5277,
+      "step": 500
+    },
+    {
+      "epoch": 1.3543624161073826,
+      "grad_norm": 1.401339054107666,
+      "learning_rate": 2.620455708131447e-05,
+      "loss": 0.479,
+      "step": 505
+    },
+    {
+      "epoch": 1.367785234899329,
+      "grad_norm": 1.1059690713882446,
+      "learning_rate": 2.6115662809998814e-05,
+      "loss": 0.5191,
+      "step": 510
+    },
+    {
+      "epoch": 1.3812080536912752,
+      "grad_norm": 0.9935198426246643,
+      "learning_rate": 2.6025894088347723e-05,
+      "loss": 0.4817,
+      "step": 515
+    },
+    {
+      "epoch": 1.3946308724832215,
+      "grad_norm": 1.139981985092163,
+      "learning_rate": 2.5935257978314233e-05,
+      "loss": 0.4679,
+      "step": 520
+    },
+    {
+      "epoch": 1.4080536912751678,
+      "grad_norm": 1.1750354766845703,
+      "learning_rate": 2.5843761610087354e-05,
+      "loss": 0.4894,
+      "step": 525
+    },
+    {
+      "epoch": 1.421476510067114,
+      "grad_norm": 1.0493199825286865,
+      "learning_rate": 2.5751412181531153e-05,
+      "loss": 0.4955,
+      "step": 530
+    },
+    {
+      "epoch": 1.4348993288590604,
+      "grad_norm": 0.992678165435791,
+      "learning_rate": 2.56582169576185e-05,
+      "loss": 0.4937,
+      "step": 535
+    },
+    {
+      "epoch": 1.4483221476510066,
+      "grad_norm": 1.1107701063156128,
+      "learning_rate": 2.556418326985956e-05,
+      "loss": 0.498,
+      "step": 540
+    },
+    {
+      "epoch": 1.4617449664429532,
+      "grad_norm": 1.1102780103683472,
+      "learning_rate": 2.5469318515725008e-05,
+      "loss": 0.4869,
+      "step": 545
+    },
+    {
+      "epoch": 1.4751677852348992,
+      "grad_norm": 1.0612019300460815,
+      "learning_rate": 2.5373630158064123e-05,
+      "loss": 0.4653,
+      "step": 550
+    },
+    {
+      "epoch": 1.4885906040268457,
+      "grad_norm": 1.0739949941635132,
+      "learning_rate": 2.5277125724517665e-05,
+      "loss": 0.4908,
+      "step": 555
+    },
+    {
+      "epoch": 1.5020134228187918,
+      "grad_norm": 1.082614779472351,
+      "learning_rate": 2.51798128069257e-05,
+      "loss": 0.4714,
+      "step": 560
+    },
+    {
+      "epoch": 1.5154362416107383,
+      "grad_norm": 1.1392021179199219,
+      "learning_rate": 2.5081699060730353e-05,
+      "loss": 0.519,
+      "step": 565
+    },
+    {
+      "epoch": 1.5288590604026846,
+      "grad_norm": 1.22454833984375,
+      "learning_rate": 2.4982792204373603e-05,
+      "loss": 0.4507,
+      "step": 570
+    },
+    {
+      "epoch": 1.542281879194631,
+      "grad_norm": 1.1701927185058594,
+      "learning_rate": 2.4883100018690028e-05,
+      "loss": 0.4418,
+      "step": 575
+    },
+    {
+      "epoch": 1.5557046979865772,
+      "grad_norm": 1.0155186653137207,
+      "learning_rate": 2.4782630346294758e-05,
+      "loss": 0.4817,
+      "step": 580
+    },
+    {
+      "epoch": 1.5691275167785235,
+      "grad_norm": 1.1472469568252563,
+      "learning_rate": 2.4681391090966466e-05,
+      "loss": 0.4083,
+      "step": 585
+    },
+    {
+      "epoch": 1.5825503355704698,
+      "grad_norm": 1.3812817335128784,
+      "learning_rate": 2.4579390217025616e-05,
+      "loss": 0.4777,
+      "step": 590
+    },
+    {
+      "epoch": 1.595973154362416,
+      "grad_norm": 1.07793128490448,
+      "learning_rate": 2.447663574870792e-05,
+      "loss": 0.4615,
+      "step": 595
+    },
+    {
+      "epoch": 1.6093959731543626,
+      "grad_norm": 1.3786406517028809,
+      "learning_rate": 2.4373135769533073e-05,
+      "loss": 0.4206,
+      "step": 600
+    },
+    {
+      "epoch": 1.6228187919463086,
+      "grad_norm": 1.0537580251693726,
+      "learning_rate": 2.426889842166885e-05,
+      "loss": 0.4562,
+      "step": 605
+    },
+    {
+      "epoch": 1.6362416107382551,
+      "grad_norm": 1.1155586242675781,
+      "learning_rate": 2.4163931905290565e-05,
+      "loss": 0.4069,
+      "step": 610
+    },
+    {
+      "epoch": 1.6496644295302012,
+      "grad_norm": 1.0712804794311523,
+      "learning_rate": 2.4058244477935986e-05,
+      "loss": 0.4405,
+      "step": 615
+    },
+    {
+      "epoch": 1.6630872483221477,
+      "grad_norm": 1.2123838663101196,
+      "learning_rate": 2.3951844453855727e-05,
+      "loss": 0.4167,
+      "step": 620
+    },
+    {
+      "epoch": 1.676510067114094,
+      "grad_norm": 1.195296049118042,
+      "learning_rate": 2.3844740203359165e-05,
+      "loss": 0.45,
+      "step": 625
+    },
+    {
+      "epoch": 1.6899328859060403,
+      "grad_norm": 1.0780428647994995,
+      "learning_rate": 2.3736940152155993e-05,
+      "loss": 0.3855,
+      "step": 630
+    },
+    {
+      "epoch": 1.7033557046979866,
+      "grad_norm": 1.1524149179458618,
+      "learning_rate": 2.362845278069335e-05,
+      "loss": 0.4176,
+      "step": 635
+    },
+    {
+      "epoch": 1.7167785234899329,
+      "grad_norm": 1.0923837423324585,
+      "learning_rate": 2.35192866234887e-05,
+      "loss": 0.4392,
+      "step": 640
+    },
+    {
+      "epoch": 1.7302013422818792,
+      "grad_norm": 1.0360054969787598,
+      "learning_rate": 2.340945026845843e-05,
+      "loss": 0.3672,
+      "step": 645
+    },
+    {
+      "epoch": 1.7436241610738255,
+      "grad_norm": 1.1665080785751343,
+      "learning_rate": 2.3298952356242248e-05,
+      "loss": 0.423,
+      "step": 650
+    },
+    {
+      "epoch": 1.757046979865772,
+      "grad_norm": 1.076156735420227,
+      "learning_rate": 2.318780157952345e-05,
+      "loss": 0.4082,
+      "step": 655
+    },
+    {
+      "epoch": 1.770469798657718,
+      "grad_norm": 1.2381459474563599,
+      "learning_rate": 2.3076006682345074e-05,
+      "loss": 0.3961,
+      "step": 660
+    },
+    {
+      "epoch": 1.7838926174496645,
+      "grad_norm": 1.0834969282150269,
+      "learning_rate": 2.296357645942202e-05,
+      "loss": 0.3938,
+      "step": 665
+    },
+    {
+      "epoch": 1.7973154362416106,
+      "grad_norm": 1.1974222660064697,
+      "learning_rate": 2.2850519755449183e-05,
+      "loss": 0.3477,
+      "step": 670
+    },
+    {
+      "epoch": 1.8107382550335571,
+      "grad_norm": 1.2139180898666382,
+      "learning_rate": 2.273684546440566e-05,
+      "loss": 0.4295,
+      "step": 675
+    },
+    {
+      "epoch": 1.8241610738255034,
+      "grad_norm": 1.1445631980895996,
+      "learning_rate": 2.2622562528855092e-05,
+      "loss": 0.3409,
+      "step": 680
+    },
+    {
+      "epoch": 1.8375838926174497,
+      "grad_norm": 1.3161190748214722,
+      "learning_rate": 2.2507679939242123e-05,
+      "loss": 0.393,
+      "step": 685
+    },
+    {
+      "epoch": 1.851006711409396,
+      "grad_norm": 1.1103390455245972,
+      "learning_rate": 2.2392206733185175e-05,
+      "loss": 0.3534,
+      "step": 690
+    },
+    {
+      "epoch": 1.8644295302013423,
+      "grad_norm": 1.1700199842453003,
+      "learning_rate": 2.2276151994765483e-05,
+      "loss": 0.391,
+      "step": 695
+    },
+    {
+      "epoch": 1.8778523489932886,
+      "grad_norm": 1.0810993909835815,
+      "learning_rate": 2.2159524853812422e-05,
+      "loss": 0.3608,
+      "step": 700
+    },
+    {
+      "epoch": 1.8912751677852349,
+      "grad_norm": 1.2553406953811646,
+      "learning_rate": 2.204233448518531e-05,
+      "loss": 0.399,
+      "step": 705
+    },
+    {
+      "epoch": 1.9046979865771814,
+      "grad_norm": 1.1730430126190186,
+      "learning_rate": 2.1924590108051635e-05,
+      "loss": 0.3848,
+      "step": 710
+    },
+    {
+      "epoch": 1.9181208053691274,
+      "grad_norm": 1.0870543718338013,
+      "learning_rate": 2.1806300985161786e-05,
+      "loss": 0.3722,
+      "step": 715
+    },
+    {
+      "epoch": 1.931543624161074,
+      "grad_norm": 1.2579858303070068,
+      "learning_rate": 2.1687476422120397e-05,
+      "loss": 0.376,
+      "step": 720
+    },
+    {
+      "epoch": 1.94496644295302,
+      "grad_norm": 1.0530152320861816,
+      "learning_rate": 2.1568125766654236e-05,
+      "loss": 0.3564,
+      "step": 725
+    },
+    {
+      "epoch": 1.9583892617449665,
+      "grad_norm": 1.1041901111602783,
+      "learning_rate": 2.1448258407876902e-05,
+      "loss": 0.3669,
+      "step": 730
+    },
+    {
+      "epoch": 1.9718120805369126,
+      "grad_norm": 1.2685492038726807,
+      "learning_rate": 2.132788377555016e-05,
+      "loss": 0.3555,
+      "step": 735
+    },
+    {
+      "epoch": 1.985234899328859,
+      "grad_norm": 1.1222282648086548,
+      "learning_rate": 2.12070113393421e-05,
+      "loss": 0.3263,
+      "step": 740
+    },
+    {
+      "epoch": 1.9986577181208054,
+      "grad_norm": 1.1764016151428223,
+      "learning_rate": 2.1085650608082222e-05,
+      "loss": 0.3398,
+      "step": 745
+    },
+    {
+      "epoch": 2.010738255033557,
+      "grad_norm": 1.2533848285675049,
+      "learning_rate": 2.096381112901337e-05,
+      "loss": 0.2537,
+      "step": 750
+    },
+    {
+      "epoch": 2.0241610738255034,
+      "grad_norm": 0.954615592956543,
+      "learning_rate": 2.084150248704067e-05,
+      "loss": 0.2506,
+      "step": 755
+    },
+    {
+      "epoch": 2.03758389261745,
+      "grad_norm": 1.0643949508666992,
+      "learning_rate": 2.071873430397747e-05,
+      "loss": 0.2948,
+      "step": 760
+    },
+    {
+      "epoch": 2.051006711409396,
+      "grad_norm": 1.2162034511566162,
+      "learning_rate": 2.059551623778846e-05,
+      "loss": 0.2649,
+      "step": 765
+    },
+    {
+      "epoch": 2.0644295302013425,
+      "grad_norm": 1.0362883806228638,
+      "learning_rate": 2.0471857981829878e-05,
+      "loss": 0.2913,
+      "step": 770
+    },
+    {
+      "epoch": 2.0778523489932885,
+      "grad_norm": 1.1844029426574707,
+      "learning_rate": 2.0347769264086916e-05,
+      "loss": 0.3286,
+      "step": 775
+    },
+    {
+      "epoch": 2.091275167785235,
+      "grad_norm": 1.0810260772705078,
+      "learning_rate": 2.0223259846408485e-05,
+      "loss": 0.2885,
+      "step": 780
+    },
+    {
+      "epoch": 2.104697986577181,
+      "grad_norm": 1.1697508096694946,
+      "learning_rate": 2.009833952373925e-05,
+      "loss": 0.2795,
+      "step": 785
+    },
+    {
+      "epoch": 2.1181208053691276,
+      "grad_norm": 1.029407024383545,
+      "learning_rate": 1.9973018123349067e-05,
+      "loss": 0.2621,
+      "step": 790
+    },
+    {
+      "epoch": 2.1315436241610737,
+      "grad_norm": 1.0687460899353027,
+      "learning_rate": 1.984730550405989e-05,
+      "loss": 0.2954,
+      "step": 795
+    },
+    {
+      "epoch": 2.14496644295302,
+      "grad_norm": 1.2070434093475342,
+      "learning_rate": 1.9721211555470197e-05,
+      "loss": 0.2972,
+      "step": 800
+    },
+    {
+      "epoch": 2.1583892617449663,
+      "grad_norm": 1.248508095741272,
+      "learning_rate": 1.9594746197177025e-05,
+      "loss": 0.2557,
+      "step": 805
+    },
+    {
+      "epoch": 2.1718120805369128,
+      "grad_norm": 0.9223589897155762,
+      "learning_rate": 1.9467919377995553e-05,
+      "loss": 0.2297,
+      "step": 810
+    },
+    {
+      "epoch": 2.185234899328859,
+      "grad_norm": 1.2255672216415405,
+      "learning_rate": 1.934074107517647e-05,
+      "loss": 0.2845,
+      "step": 815
+    },
+    {
+      "epoch": 2.1986577181208053,
+      "grad_norm": 1.2953975200653076,
+      "learning_rate": 1.9213221293621117e-05,
+      "loss": 0.2187,
+      "step": 820
+    },
+    {
+      "epoch": 2.212080536912752,
+      "grad_norm": 1.223412275314331,
+      "learning_rate": 1.9085370065094367e-05,
+      "loss": 0.2837,
+      "step": 825
+    },
+    {
+      "epoch": 2.225503355704698,
+      "grad_norm": 1.351789951324463,
+      "learning_rate": 1.8957197447435458e-05,
+      "loss": 0.2387,
+      "step": 830
+    },
+    {
+      "epoch": 2.2389261744966444,
+      "grad_norm": 1.0444451570510864,
+      "learning_rate": 1.8828713523766784e-05,
+      "loss": 0.2384,
+      "step": 835
+    },
+    {
+      "epoch": 2.2523489932885905,
+      "grad_norm": 1.1646713018417358,
+      "learning_rate": 1.8699928401700642e-05,
+      "loss": 0.2669,
+      "step": 840
+    },
+    {
+      "epoch": 2.265771812080537,
+      "grad_norm": 1.4678176641464233,
+      "learning_rate": 1.8570852212544108e-05,
+      "loss": 0.2723,
+      "step": 845
+    },
+    {
+      "epoch": 2.279194630872483,
+      "grad_norm": 1.0317347049713135,
+      "learning_rate": 1.8441495110501986e-05,
+      "loss": 0.2644,
+      "step": 850
+    },
+    {
+      "epoch": 2.2926174496644296,
+      "grad_norm": 1.382323980331421,
+      "learning_rate": 1.8311867271878057e-05,
+      "loss": 0.276,
+      "step": 855
+    },
+    {
+      "epoch": 2.3060402684563757,
+      "grad_norm": 1.1337871551513672,
+      "learning_rate": 1.818197889427446e-05,
+      "loss": 0.2535,
+      "step": 860
+    },
+    {
+      "epoch": 2.319463087248322,
+      "grad_norm": 1.0053396224975586,
+      "learning_rate": 1.805184019578951e-05,
+      "loss": 0.2476,
+      "step": 865
+    },
+    {
+      "epoch": 2.3328859060402687,
+      "grad_norm": 1.1298002004623413,
+      "learning_rate": 1.792146141421383e-05,
+      "loss": 0.265,
+      "step": 870
+    },
+    {
+      "epoch": 2.3463087248322148,
+      "grad_norm": 1.119316577911377,
+      "learning_rate": 1.7790852806224978e-05,
+      "loss": 0.2585,
+      "step": 875
+    },
+    {
+      "epoch": 2.3597315436241613,
+      "grad_norm": 1.1393946409225464,
+      "learning_rate": 1.7660024646580573e-05,
+      "loss": 0.245,
+      "step": 880
+    },
+    {
+      "epoch": 2.3731543624161073,
+      "grad_norm": 1.089766502380371,
+      "learning_rate": 1.7528987227309974e-05,
+      "loss": 0.2598,
+      "step": 885
+    },
+    {
+      "epoch": 2.386577181208054,
+      "grad_norm": 1.2876455783843994,
+      "learning_rate": 1.7397750856904653e-05,
+      "loss": 0.2398,
+      "step": 890
+    },
+    {
+      "epoch": 2.4,
+      "grad_norm": 1.247550129890442,
+      "learning_rate": 1.7266325859507228e-05,
+      "loss": 0.2306,
+      "step": 895
+    },
+    {
+      "epoch": 2.4134228187919464,
+      "grad_norm": 1.1357083320617676,
+      "learning_rate": 1.713472257409928e-05,
+      "loss": 0.2652,
+      "step": 900
+    },
+    {
+      "epoch": 2.4268456375838925,
+      "grad_norm": 1.4018968343734741,
+      "learning_rate": 1.7002951353688e-05,
+      "loss": 0.2237,
+      "step": 905
+    },
+    {
+      "epoch": 2.440268456375839,
+      "grad_norm": 1.1884123086929321,
+      "learning_rate": 1.6871022564491753e-05,
+      "loss": 0.266,
+      "step": 910
+    },
+    {
+      "epoch": 2.453691275167785,
+      "grad_norm": 1.113447904586792,
+      "learning_rate": 1.6738946585124565e-05,
+      "loss": 0.2105,
+      "step": 915
+    },
+    {
+      "epoch": 2.4671140939597316,
+      "grad_norm": 1.12835693359375,
+      "learning_rate": 1.6606733805779663e-05,
+      "loss": 0.2465,
+      "step": 920
+    },
+    {
+      "epoch": 2.4805369127516776,
+      "grad_norm": 1.2548681497573853,
+      "learning_rate": 1.64743946274121e-05,
+      "loss": 0.213,
+      "step": 925
+    },
+    {
+      "epoch": 2.493959731543624,
+      "grad_norm": 1.1149048805236816,
+      "learning_rate": 1.6341939460920524e-05,
+      "loss": 0.2353,
+      "step": 930
+    },
+    {
+      "epoch": 2.5073825503355707,
+      "grad_norm": 1.0424855947494507,
+      "learning_rate": 1.6209378726328168e-05,
+      "loss": 0.2084,
+      "step": 935
+    },
+    {
+      "epoch": 2.5208053691275167,
+      "grad_norm": 1.1017965078353882,
+      "learning_rate": 1.6076722851963135e-05,
+      "loss": 0.2135,
+      "step": 940
+    },
+    {
+      "epoch": 2.5342281879194632,
+      "grad_norm": 1.112675666809082,
+      "learning_rate": 1.5943982273638008e-05,
+      "loss": 0.2233,
+      "step": 945
+    },
+    {
+      "epoch": 2.5476510067114093,
+      "grad_norm": 1.1784453392028809,
+      "learning_rate": 1.581116743382889e-05,
+      "loss": 0.2044,
+      "step": 950
+    },
+    {
+      "epoch": 2.561073825503356,
+      "grad_norm": 1.257094144821167,
+      "learning_rate": 1.5678288780853903e-05,
+      "loss": 0.2434,
+      "step": 955
+    },
+    {
+      "epoch": 2.574496644295302,
+      "grad_norm": 1.1470935344696045,
+      "learning_rate": 1.554535676805125e-05,
+      "loss": 0.2308,
+      "step": 960
+    },
+    {
+      "epoch": 2.5879194630872484,
+      "grad_norm": 1.0138940811157227,
+      "learning_rate": 1.5412381852956858e-05,
+      "loss": 0.2136,
+      "step": 965
+    },
+    {
+      "epoch": 2.6013422818791945,
+      "grad_norm": 1.0910656452178955,
+      "learning_rate": 1.5279374496481708e-05,
+      "loss": 0.1905,
+      "step": 970
+    },
+    {
+      "epoch": 2.614765100671141,
+      "grad_norm": 1.3550100326538086,
+      "learning_rate": 1.5146345162088871e-05,
+      "loss": 0.2043,
+      "step": 975
+    },
+    {
+      "epoch": 2.6281879194630875,
+      "grad_norm": 1.0329605340957642,
+      "learning_rate": 1.5013304314970414e-05,
+      "loss": 0.2153,
+      "step": 980
+    },
+    {
+      "epoch": 2.6416107382550336,
+      "grad_norm": 1.252626657485962,
+      "learning_rate": 1.4880262421224075e-05,
+      "loss": 0.2115,
+      "step": 985
+    },
+    {
+      "epoch": 2.6550335570469796,
+      "grad_norm": 1.0826952457427979,
+      "learning_rate": 1.4747229947029918e-05,
+      "loss": 0.2011,
+      "step": 990
+    },
+    {
+      "epoch": 2.668456375838926,
+      "grad_norm": 1.119947910308838,
+      "learning_rate": 1.4614217357826995e-05,
+      "loss": 0.2054,
+      "step": 995
+    },
+    {
+      "epoch": 2.6818791946308727,
+      "grad_norm": 1.2884759902954102,
+      "learning_rate": 1.4481235117490053e-05,
+      "loss": 0.1993,
+      "step": 1000
+    },
+    {
+      "epoch": 2.6953020134228187,
+      "grad_norm": 1.221529483795166,
+      "learning_rate": 1.434829368750633e-05,
+      "loss": 0.2035,
+      "step": 1005
+    },
+    {
+      "epoch": 2.7087248322147652,
+      "grad_norm": 1.2037923336029053,
+      "learning_rate": 1.4215403526152583e-05,
+      "loss": 0.2078,
+      "step": 1010
+    },
+    {
+      "epoch": 2.7221476510067113,
+      "grad_norm": 1.3824589252471924,
+      "learning_rate": 1.4082575087672363e-05,
+      "loss": 0.2169,
+      "step": 1015
+    },
+    {
+      "epoch": 2.735570469798658,
+      "grad_norm": 1.1602627038955688,
+      "learning_rate": 1.3949818821453573e-05,
+      "loss": 0.2305,
+      "step": 1020
+    },
+    {
+      "epoch": 2.748993288590604,
+      "grad_norm": 1.1708049774169922,
+      "learning_rate": 1.3817145171206455e-05,
+      "loss": 0.2331,
+      "step": 1025
+    },
+    {
+      "epoch": 2.7624161073825504,
+      "grad_norm": 1.0684705972671509,
+      "learning_rate": 1.3684564574141992e-05,
+      "loss": 0.2126,
+      "step": 1030
+    },
+    {
+      "epoch": 2.7758389261744965,
+      "grad_norm": 1.5986040830612183,
+      "learning_rate": 1.3552087460150836e-05,
+      "loss": 0.1915,
+      "step": 1035
+    },
+    {
+      "epoch": 2.789261744966443,
+      "grad_norm": 1.1710542440414429,
+      "learning_rate": 1.3419724250982795e-05,
+      "loss": 0.2027,
+      "step": 1040
+    },
+    {
+      "epoch": 2.8026845637583895,
+      "grad_norm": 1.1092584133148193,
+      "learning_rate": 1.3287485359426974e-05,
+      "loss": 0.213,
+      "step": 1045
+    },
+    {
+      "epoch": 2.8161073825503355,
+      "grad_norm": 1.3413935899734497,
+      "learning_rate": 1.3155381188492633e-05,
+      "loss": 0.196,
+      "step": 1050
+    },
+    {
+      "epoch": 2.8295302013422816,
+      "grad_norm": 1.2237411737442017,
+      "learning_rate": 1.3023422130590778e-05,
+      "loss": 0.1919,
+      "step": 1055
+    },
+    {
+      "epoch": 2.842953020134228,
+      "grad_norm": 1.0657061338424683,
+      "learning_rate": 1.2891618566716627e-05,
+      "loss": 0.2022,
+      "step": 1060
+    },
+    {
+      "epoch": 2.8563758389261746,
+      "grad_norm": 1.2195141315460205,
+      "learning_rate": 1.275998086563294e-05,
+      "loss": 0.2053,
+      "step": 1065
+    },
+    {
+      "epoch": 2.8697986577181207,
+      "grad_norm": 1.0146418809890747,
+      "learning_rate": 1.2628519383054343e-05,
+      "loss": 0.1876,
+      "step": 1070
+    },
+    {
+      "epoch": 2.883221476510067,
+      "grad_norm": 1.0290652513504028,
+      "learning_rate": 1.2497244460832644e-05,
+      "loss": 0.1743,
+      "step": 1075
+    },
+    {
+      "epoch": 2.8966442953020133,
+      "grad_norm": 1.272923231124878,
+      "learning_rate": 1.2366166426143262e-05,
+      "loss": 0.2261,
+      "step": 1080
+    },
+    {
+      "epoch": 2.91006711409396,
+      "grad_norm": 1.1233028173446655,
+      "learning_rate": 1.2235295590672816e-05,
+      "loss": 0.1996,
+      "step": 1085
+    },
+    {
+      "epoch": 2.9234899328859063,
+      "grad_norm": 1.1806379556655884,
+      "learning_rate": 1.2104642249807901e-05,
+      "loss": 0.1859,
+      "step": 1090
+    },
+    {
+      "epoch": 2.9369127516778524,
+      "grad_norm": 1.259519100189209,
+      "learning_rate": 1.1974216681825193e-05,
+      "loss": 0.1911,
+      "step": 1095
+    },
+    {
+      "epoch": 2.9503355704697984,
+      "grad_norm": 1.2275830507278442,
+      "learning_rate": 1.1844029147082853e-05,
+      "loss": 0.2005,
+      "step": 1100
+    },
+    {
+      "epoch": 2.963758389261745,
+      "grad_norm": 1.0096421241760254,
+      "learning_rate": 1.1714089887213382e-05,
+      "loss": 0.2013,
+      "step": 1105
+    },
+    {
+      "epoch": 2.9771812080536915,
+      "grad_norm": 1.4130147695541382,
+      "learning_rate": 1.1584409124317906e-05,
+      "loss": 0.185,
+      "step": 1110
+    },
+    {
+      "epoch": 2.9906040268456375,
+      "grad_norm": 1.0541446208953857,
+      "learning_rate": 1.1454997060162038e-05,
+      "loss": 0.1766,
+      "step": 1115
+    },
+    {
+      "epoch": 3.002684563758389,
+      "grad_norm": 0.870124340057373,
+      "learning_rate": 1.1325863875373313e-05,
+      "loss": 0.1454,
+      "step": 1120
+    },
+    {
+      "epoch": 3.0161073825503357,
+      "grad_norm": 1.0456061363220215,
+      "learning_rate": 1.1197019728640318e-05,
+      "loss": 0.1231,
+      "step": 1125
+    },
+    {
+      "epoch": 3.029530201342282,
+      "grad_norm": 1.1627365350723267,
+      "learning_rate": 1.1068474755913473e-05,
+      "loss": 0.1312,
+      "step": 1130
+    },
+    {
+      "epoch": 3.0429530201342283,
+      "grad_norm": 0.9357373714447021,
+      "learning_rate": 1.0940239069607705e-05,
+      "loss": 0.1362,
+      "step": 1135
+    },
+    {
+      "epoch": 3.0563758389261744,
+      "grad_norm": 1.0598500967025757,
+      "learning_rate": 1.0812322757806915e-05,
+      "loss": 0.1493,
+      "step": 1140
+    },
+    {
+      "epoch": 3.069798657718121,
+      "grad_norm": 0.9672755599021912,
+      "learning_rate": 1.0684735883470333e-05,
+      "loss": 0.1546,
+      "step": 1145
+    },
+    {
+      "epoch": 3.083221476510067,
+      "grad_norm": 0.9811370372772217,
+      "learning_rate": 1.0557488483640914e-05,
+      "loss": 0.1445,
+      "step": 1150
+    },
+    {
+      "epoch": 3.0966442953020135,
+      "grad_norm": 1.0865955352783203,
+      "learning_rate": 1.0430590568655722e-05,
+      "loss": 0.1328,
+      "step": 1155
+    },
+    {
+      "epoch": 3.1100671140939595,
+      "grad_norm": 1.111515760421753,
+      "learning_rate": 1.0304052121358456e-05,
+      "loss": 0.1432,
+      "step": 1160
+    },
+    {
+      "epoch": 3.123489932885906,
+      "grad_norm": 0.8942503929138184,
+      "learning_rate": 1.017788309631408e-05,
+      "loss": 0.13,
+      "step": 1165
+    },
+    {
+      "epoch": 3.1369127516778526,
+      "grad_norm": 1.0307382345199585,
+      "learning_rate": 1.0052093419025745e-05,
+      "loss": 0.1509,
+      "step": 1170
+    },
+    {
+      "epoch": 3.1503355704697986,
+      "grad_norm": 1.046774983406067,
+      "learning_rate": 9.92669298515397e-06,
+      "loss": 0.1354,
+      "step": 1175
+    },
+    {
+      "epoch": 3.163758389261745,
+      "grad_norm": 1.1899687051773071,
+      "learning_rate": 9.80169165973814e-06,
+      "loss": 0.1304,
+      "step": 1180
+    },
+    {
+      "epoch": 3.177181208053691,
+      "grad_norm": 1.101287603378296,
+      "learning_rate": 9.67709927642046e-06,
+      "loss": 0.1205,
+      "step": 1185
+    },
+    {
+      "epoch": 3.1906040268456377,
+      "grad_norm": 1.0315608978271484,
+      "learning_rate": 9.552925636672352e-06,
+      "loss": 0.1315,
+      "step": 1190
+    },
+    {
+      "epoch": 3.2040268456375838,
+      "grad_norm": 0.9654178619384766,
+      "learning_rate": 9.429180509023402e-06,
+      "loss": 0.1424,
+      "step": 1195
+    },
+    {
+      "epoch": 3.2174496644295303,
+      "grad_norm": 1.0191446542739868,
+      "learning_rate": 9.305873628292856e-06,
+      "loss": 0.1349,
+      "step": 1200
+    },
+    {
+      "epoch": 3.2308724832214764,
+      "grad_norm": 1.1478456258773804,
+      "learning_rate": 9.18301469482383e-06,
+      "loss": 0.1304,
+      "step": 1205
+    },
+    {
+      "epoch": 3.244295302013423,
+      "grad_norm": 1.1973252296447754,
+      "learning_rate": 9.060613373720198e-06,
+      "loss": 0.1393,
+      "step": 1210
+    },
+    {
+      "epoch": 3.257718120805369,
+      "grad_norm": 0.9078258275985718,
+      "learning_rate": 8.938679294086226e-06,
+      "loss": 0.1365,
+      "step": 1215
+    },
+    {
+      "epoch": 3.2711409395973154,
+      "grad_norm": 1.0786795616149902,
+      "learning_rate": 8.817222048269104e-06,
+      "loss": 0.1384,
+      "step": 1220
+    },
+    {
+      "epoch": 3.284563758389262,
+      "grad_norm": 1.0708985328674316,
+      "learning_rate": 8.696251191104302e-06,
+      "loss": 0.1386,
+      "step": 1225
+    },
+    {
+      "epoch": 3.297986577181208,
+      "grad_norm": 0.9283603429794312,
+      "learning_rate": 8.575776239163931e-06,
+      "loss": 0.139,
+      "step": 1230
+    },
+    {
+      "epoch": 3.3114093959731545,
+      "grad_norm": 0.994650661945343,
+      "learning_rate": 8.455806670008073e-06,
+      "loss": 0.1164,
+      "step": 1235
+    },
+    {
+      "epoch": 3.3248322147651006,
+      "grad_norm": 1.165156602859497,
+      "learning_rate": 8.33635192143919e-06,
+      "loss": 0.1215,
+      "step": 1240
+    },
+    {
+      "epoch": 3.338255033557047,
+      "grad_norm": 0.9502781629562378,
+      "learning_rate": 8.217421390759717e-06,
+      "loss": 0.1258,
+      "step": 1245
+    },
+    {
+      "epoch": 3.351677852348993,
+      "grad_norm": 0.9705508947372437,
+      "learning_rate": 8.099024434032719e-06,
+      "loss": 0.1316,
+      "step": 1250
+    },
+    {
+      "epoch": 3.3651006711409397,
+      "grad_norm": 1.0465086698532104,
+      "learning_rate": 7.981170365345924e-06,
+      "loss": 0.1431,
+      "step": 1255
+    },
+    {
+      "epoch": 3.3785234899328858,
+      "grad_norm": 0.9740952253341675,
+      "learning_rate": 7.863868456078987e-06,
+      "loss": 0.1305,
+      "step": 1260
+    },
+    {
+      "epoch": 3.3919463087248323,
+      "grad_norm": 0.9972068667411804,
+      "learning_rate": 7.747127934174094e-06,
+      "loss": 0.1353,
+      "step": 1265
+    },
+    {
+      "epoch": 3.4053691275167783,
+      "grad_norm": 0.981585681438446,
+      "learning_rate": 7.63095798341005e-06,
+      "loss": 0.1282,
+      "step": 1270
+    },
+    {
+      "epoch": 3.418791946308725,
+      "grad_norm": 0.977643609046936,
+      "learning_rate": 7.515367742679809e-06,
+      "loss": 0.1341,
+      "step": 1275
+    },
+    {
+      "epoch": 3.432214765100671,
+      "grad_norm": 1.132516622543335,
+      "learning_rate": 7.4003663052714965e-06,
+      "loss": 0.1208,
+      "step": 1280
+    },
+    {
+      "epoch": 3.4456375838926174,
+      "grad_norm": 0.925231397151947,
+      "learning_rate": 7.285962718153099e-06,
+      "loss": 0.1227,
+      "step": 1285
+    },
+    {
+      "epoch": 3.459060402684564,
+      "grad_norm": 0.8907437324523926,
+      "learning_rate": 7.172165981260735e-06,
+      "loss": 0.1314,
+      "step": 1290
+    },
+    {
+      "epoch": 3.47248322147651,
+      "grad_norm": 1.0845379829406738,
+      "learning_rate": 7.058985046790626e-06,
+      "loss": 0.1353,
+      "step": 1295
+    },
+    {
+      "epoch": 3.4859060402684565,
+      "grad_norm": 0.8987553715705872,
+      "learning_rate": 6.946428818494886e-06,
+      "loss": 0.1322,
+      "step": 1300
+    },
+    {
+      "epoch": 3.4993288590604026,
+      "grad_norm": 0.7804974317550659,
+      "learning_rate": 6.834506150981038e-06,
+      "loss": 0.1177,
+      "step": 1305
+    },
+    {
+      "epoch": 3.512751677852349,
+      "grad_norm": 0.9216485619544983,
+      "learning_rate": 6.7232258490154485e-06,
+      "loss": 0.1173,
+      "step": 1310
+    },
+    {
+      "epoch": 3.526174496644295,
+      "grad_norm": 0.9866665601730347,
+      "learning_rate": 6.6125966668307e-06,
+      "loss": 0.1079,
+      "step": 1315
+    },
+    {
+      "epoch": 3.5395973154362417,
+      "grad_norm": 1.0669224262237549,
+      "learning_rate": 6.5026273074368575e-06,
+      "loss": 0.1175,
+      "step": 1320
+    },
+    {
+      "epoch": 3.5530201342281877,
+      "grad_norm": 1.1475476026535034,
+      "learning_rate": 6.393326421936868e-06,
+      "loss": 0.1214,
+      "step": 1325
+    },
+    {
+      "epoch": 3.5664429530201343,
+      "grad_norm": 0.9193068742752075,
+      "learning_rate": 6.284702608845968e-06,
+      "loss": 0.1339,
+      "step": 1330
+    },
+    {
+      "epoch": 3.5798657718120808,
+      "grad_norm": 0.9250863790512085,
+      "learning_rate": 6.176764413415242e-06,
+      "loss": 0.1132,
+      "step": 1335
+    },
+    {
+      "epoch": 3.593288590604027,
+      "grad_norm": 1.0149400234222412,
+      "learning_rate": 6.069520326959417e-06,
+      "loss": 0.1115,
+      "step": 1340
+    },
+    {
+      "epoch": 3.606711409395973,
+      "grad_norm": 0.8474932909011841,
+      "learning_rate": 5.9629787861888285e-06,
+      "loss": 0.1115,
+      "step": 1345
+    },
+    {
+      "epoch": 3.6201342281879194,
+      "grad_norm": 0.8875136375427246,
+      "learning_rate": 5.857148172545734e-06,
+      "loss": 0.1109,
+      "step": 1350
+    },
+    {
+      "epoch": 3.633557046979866,
+      "grad_norm": 1.1870776414871216,
+      "learning_rate": 5.752036811544973e-06,
+      "loss": 0.1066,
+      "step": 1355
+    },
+    {
+      "epoch": 3.646979865771812,
+      "grad_norm": 0.9612839221954346,
+      "learning_rate": 5.647652972118998e-06,
+      "loss": 0.1284,
+      "step": 1360
+    },
+    {
+      "epoch": 3.6604026845637585,
+      "grad_norm": 1.1399682760238647,
+      "learning_rate": 5.544004865967358e-06,
+      "loss": 0.1035,
+      "step": 1365
+    },
+    {
+      "epoch": 3.6738255033557046,
+      "grad_norm": 1.0047740936279297,
+      "learning_rate": 5.441100646910733e-06,
+      "loss": 0.0993,
+      "step": 1370
+    },
+    {
+      "epoch": 3.687248322147651,
+      "grad_norm": 0.9941322803497314,
+      "learning_rate": 5.338948410249454e-06,
+      "loss": 0.1305,
+      "step": 1375
+    },
+    {
+      "epoch": 3.7006711409395976,
+      "grad_norm": 1.0649505853652954,
+      "learning_rate": 5.237556192126671e-06,
+      "loss": 0.1208,
+      "step": 1380
+    },
+    {
+      "epoch": 3.7140939597315437,
+      "grad_norm": 1.0379612445831299,
+      "learning_rate": 5.1369319688961835e-06,
+      "loss": 0.1111,
+      "step": 1385
+    },
+    {
+      "epoch": 3.7275167785234897,
+      "grad_norm": 0.9948301911354065,
+      "learning_rate": 5.03708365649491e-06,
+      "loss": 0.1225,
+      "step": 1390
+    },
+    {
+      "epoch": 3.7409395973154362,
+      "grad_norm": 0.790232241153717,
+      "learning_rate": 4.9380191098202e-06,
+      "loss": 0.1241,
+      "step": 1395
+    },
+    {
+      "epoch": 3.7543624161073827,
+      "grad_norm": 0.8969043493270874,
+      "learning_rate": 4.839746122111883e-06,
+      "loss": 0.0969,
+      "step": 1400
+    },
+    {
+      "epoch": 3.767785234899329,
+      "grad_norm": 0.9660268425941467,
+      "learning_rate": 4.742272424339168e-06,
+      "loss": 0.1105,
+      "step": 1405
+    },
+    {
+      "epoch": 3.7812080536912753,
+      "grad_norm": 0.7990521192550659,
+      "learning_rate": 4.645605684592505e-06,
+      "loss": 0.108,
+      "step": 1410
+    },
+    {
+      "epoch": 3.7946308724832214,
+      "grad_norm": 0.9827368855476379,
+      "learning_rate": 4.54975350748031e-06,
+      "loss": 0.1204,
+      "step": 1415
+    },
+    {
+      "epoch": 3.808053691275168,
+      "grad_norm": 0.8969358801841736,
+      "learning_rate": 4.454723433530736e-06,
+      "loss": 0.1069,
+      "step": 1420
+    },
+    {
+      "epoch": 3.821476510067114,
+      "grad_norm": 1.0632431507110596,
+      "learning_rate": 4.3605229385984915e-06,
+      "loss": 0.0896,
+      "step": 1425
+    },
+    {
+      "epoch": 3.8348993288590605,
+      "grad_norm": 0.8857918977737427,
+      "learning_rate": 4.26715943327669e-06,
+      "loss": 0.109,
+      "step": 1430
+    },
+    {
+      "epoch": 3.8483221476510066,
+      "grad_norm": 0.9165077209472656,
+      "learning_rate": 4.174640262313912e-06,
+      "loss": 0.1261,
+      "step": 1435
+    },
+    {
+      "epoch": 3.861744966442953,
+      "grad_norm": 0.932483971118927,
+      "learning_rate": 4.082972704036376e-06,
+      "loss": 0.1228,
+      "step": 1440
+    },
+    {
+      "epoch": 3.8751677852348996,
+      "grad_norm": 0.8068802952766418,
+      "learning_rate": 3.992163969775376e-06,
+      "loss": 0.0906,
+      "step": 1445
+    },
+    {
+      "epoch": 3.8885906040268456,
+      "grad_norm": 0.9022645950317383,
+      "learning_rate": 3.902221203299974e-06,
+      "loss": 0.1065,
+      "step": 1450
+    },
+    {
+      "epoch": 3.9020134228187917,
+      "grad_norm": 1.0112144947052002,
+      "learning_rate": 3.813151480255026e-06,
+      "loss": 0.1098,
+      "step": 1455
+    },
+    {
+      "epoch": 3.915436241610738,
+      "grad_norm": 0.8892555832862854,
+      "learning_rate": 3.7249618076045316e-06,
+      "loss": 0.107,
+      "step": 1460
+    },
+    {
+      "epoch": 3.9288590604026847,
+      "grad_norm": 0.8218262195587158,
+      "learning_rate": 3.6376591230804245e-06,
+      "loss": 0.0931,
+      "step": 1465
+    },
+    {
+      "epoch": 3.942281879194631,
+      "grad_norm": 0.9167711138725281,
+      "learning_rate": 3.5512502946367935e-06,
+      "loss": 0.105,
+      "step": 1470
+    },
+    {
+      "epoch": 3.9557046979865773,
+      "grad_norm": 0.9390382170677185,
+      "learning_rate": 3.465742119909568e-06,
+      "loss": 0.1207,
+      "step": 1475
+    },
+    {
+      "epoch": 3.9691275167785234,
+      "grad_norm": 0.9876239895820618,
+      "learning_rate": 3.3811413256817997e-06,
+      "loss": 0.1085,
+      "step": 1480
+    },
+    {
+      "epoch": 3.98255033557047,
+      "grad_norm": 0.8933213353157043,
+      "learning_rate": 3.297454567354436e-06,
+      "loss": 0.1,
+      "step": 1485
+    },
+    {
+      "epoch": 3.995973154362416,
+      "grad_norm": 0.8091711401939392,
+      "learning_rate": 3.214688428422779e-06,
+      "loss": 0.1149,
+      "step": 1490
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1865,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.9850279367632486e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

15_128_e5_3e-5/checkpoint-1492/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:074596f08a98a2a8988623657ab3d8064774b9b44f8ab9eecb22b494b6b7d866
+size 7736

15_128_e5_3e-5/checkpoint-1492/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1492/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

15_128_e5_3e-5/checkpoint-1865/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

15_128_e5_3e-5/checkpoint-1865/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "gate_proj",
+    "down_proj",
+    "up_proj",
+    "q_proj",
+    "v_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

15_128_e5_3e-5/checkpoint-1865/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62230b058aa940980d599d037bd2d0bff0e7aa1fa97c82fbb34001704277b2d6
+size 791751704

15_128_e5_3e-5/checkpoint-1865/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1865

15_128_e5_3e-5/checkpoint-1865/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

15_128_e5_3e-5/checkpoint-1865/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1253e24abda8c2e351796f8b37125e036c77bb25e564afe1fe7f04980d0f553a
+size 15920

15_128_e5_3e-5/checkpoint-1865/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59ff1c4d189a1a562289a0a6c72b9c56c4351238c77b6a073b3793074f8f78c2
+size 15920

15_128_e5_3e-5/checkpoint-1865/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5eb69ff64b50e05ce3183528880b93c291dbcac9784bd036ad1b17222146d3f1
+size 15920