RayDu0010 commited on Jun 26, 2025

Commit

4659434

verified ·

1 Parent(s): 9e7137c

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

20_128_e5_3e-5/checkpoint-1324/README.md +202 -0
20_128_e5_3e-5/checkpoint-1324/adapter_config.json +39 -0
20_128_e5_3e-5/checkpoint-1324/adapter_model.safetensors +3 -0
20_128_e5_3e-5/checkpoint-1324/latest +1 -0
20_128_e5_3e-5/checkpoint-1324/merges.txt +0 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_0.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_1.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_2.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_3.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_4.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_5.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_6.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/rng_state_7.pth +3 -0
20_128_e5_3e-5/checkpoint-1324/scheduler.pt +3 -0
20_128_e5_3e-5/checkpoint-1324/special_tokens_map.json +45 -0
20_128_e5_3e-5/checkpoint-1324/tokenizer.json +0 -0
20_128_e5_3e-5/checkpoint-1324/tokenizer_config.json +188 -0
20_128_e5_3e-5/checkpoint-1324/trainer_state.json +1882 -0
20_128_e5_3e-5/checkpoint-1324/training_args.bin +3 -0
20_128_e5_3e-5/checkpoint-1324/vocab.json +0 -0
20_128_e5_3e-5/checkpoint-1324/zero_to_fp32.py +604 -0
20_128_e5_3e-5/checkpoint-1655/README.md +202 -0
20_128_e5_3e-5/checkpoint-1655/adapter_config.json +39 -0
20_128_e5_3e-5/checkpoint-1655/adapter_model.safetensors +3 -0
20_128_e5_3e-5/checkpoint-1655/latest +1 -0
20_128_e5_3e-5/checkpoint-1655/merges.txt +0 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_0.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_1.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_2.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_3.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_4.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_5.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_6.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/rng_state_7.pth +3 -0
20_128_e5_3e-5/checkpoint-1655/scheduler.pt +3 -0
20_128_e5_3e-5/checkpoint-1655/special_tokens_map.json +45 -0
20_128_e5_3e-5/checkpoint-1655/tokenizer.json +0 -0
20_128_e5_3e-5/checkpoint-1655/tokenizer_config.json +188 -0
20_128_e5_3e-5/checkpoint-1655/trainer_state.json +2351 -0
20_128_e5_3e-5/checkpoint-1655/training_args.bin +3 -0
20_128_e5_3e-5/checkpoint-1655/vocab.json +0 -0
20_128_e5_3e-5/checkpoint-1655/zero_to_fp32.py +604 -0
20_128_e5_3e-5/checkpoint-331/README.md +202 -0
20_128_e5_3e-5/checkpoint-331/adapter_config.json +39 -0
20_128_e5_3e-5/checkpoint-331/adapter_model.safetensors +3 -0
20_128_e5_3e-5/checkpoint-331/latest +1 -0
20_128_e5_3e-5/checkpoint-331/merges.txt +0 -0
20_128_e5_3e-5/checkpoint-331/rng_state_0.pth +3 -0
20_128_e5_3e-5/checkpoint-331/rng_state_1.pth +3 -0
20_128_e5_3e-5/checkpoint-331/rng_state_2.pth +3 -0

20_128_e5_3e-5/checkpoint-1324/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

20_128_e5_3e-5/checkpoint-1324/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "v_proj",
+    "down_proj",
+    "o_proj",
+    "q_proj",
+    "k_proj",
+    "gate_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

20_128_e5_3e-5/checkpoint-1324/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca05752da61a80739bf5798a0c549f952aca31fa1eff33e19e1f774b65a51852
+size 791751704

20_128_e5_3e-5/checkpoint-1324/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1324

20_128_e5_3e-5/checkpoint-1324/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1324/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:38005bb459554e963853efcbb438b851e3ab41d63ca32a19f7866a90d54cb9aa
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:97e1203b4db745d44fa014f5ff222401a27b144668f24cd545403992d3e4e518
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:15acd5cf97c69094f345367457655ef2f3d0a0de4e459ff50ea3cb1aeea0bcd8
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c66d82644a46c7c46311f7cf0e3700341af387781fa71e9aa6dc28382f79af8
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:984a6f764f72144e5bef9dc7775cc3415dace5f3d52ac0a51da7304be657c348
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:458bdf9734d1bad01756db0904b7e43ec430b71dd14405f1754d0ec073634c5e
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5a800e8e8b71992dc0325ff71939e298c6040638829306239d9a1afa8565e16
+size 15920

20_128_e5_3e-5/checkpoint-1324/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:82b5bb9cea5e6a743ab201e36e483a1a38fa829c9b21a1bbd187c3cc43caa9f8
+size 15920

20_128_e5_3e-5/checkpoint-1324/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:295feaafe6ea009a0d89eb0b5bc1164394473cd9e7ba44e52f8263fabf289952
+size 1064

20_128_e5_3e-5/checkpoint-1324/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

20_128_e5_3e-5/checkpoint-1324/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1324/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

20_128_e5_3e-5/checkpoint-1324/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1882 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "eval_steps": 500,
+  "global_step": 1324,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.015128593040847202,
+      "grad_norm": 1.1886738538742065,
+      "learning_rate": 1.4457831325301207e-06,
+      "loss": 1.2563,
+      "step": 5
+    },
+    {
+      "epoch": 0.030257186081694403,
+      "grad_norm": 0.9057577848434448,
+      "learning_rate": 3.2530120481927713e-06,
+      "loss": 1.3015,
+      "step": 10
+    },
+    {
+      "epoch": 0.0453857791225416,
+      "grad_norm": 0.6479454040527344,
+      "learning_rate": 5.060240963855422e-06,
+      "loss": 1.2638,
+      "step": 15
+    },
+    {
+      "epoch": 0.060514372163388806,
+      "grad_norm": 0.5733408331871033,
+      "learning_rate": 6.867469879518072e-06,
+      "loss": 1.2674,
+      "step": 20
+    },
+    {
+      "epoch": 0.07564296520423601,
+      "grad_norm": 0.4907199740409851,
+      "learning_rate": 8.674698795180722e-06,
+      "loss": 1.2763,
+      "step": 25
+    },
+    {
+      "epoch": 0.0907715582450832,
+      "grad_norm": 0.5530977845191956,
+      "learning_rate": 1.0481927710843374e-05,
+      "loss": 1.2184,
+      "step": 30
+    },
+    {
+      "epoch": 0.1059001512859304,
+      "grad_norm": 0.568281352519989,
+      "learning_rate": 1.2289156626506024e-05,
+      "loss": 1.2443,
+      "step": 35
+    },
+    {
+      "epoch": 0.12102874432677761,
+      "grad_norm": 0.7138775587081909,
+      "learning_rate": 1.4096385542168676e-05,
+      "loss": 1.2258,
+      "step": 40
+    },
+    {
+      "epoch": 0.1361573373676248,
+      "grad_norm": 0.4135199189186096,
+      "learning_rate": 1.5903614457831326e-05,
+      "loss": 1.1615,
+      "step": 45
+    },
+    {
+      "epoch": 0.15128593040847202,
+      "grad_norm": 0.4403356909751892,
+      "learning_rate": 1.7710843373493978e-05,
+      "loss": 1.2148,
+      "step": 50
+    },
+    {
+      "epoch": 0.1664145234493192,
+      "grad_norm": 0.4433908462524414,
+      "learning_rate": 1.9518072289156627e-05,
+      "loss": 1.1543,
+      "step": 55
+    },
+    {
+      "epoch": 0.1815431164901664,
+      "grad_norm": 0.4714948832988739,
+      "learning_rate": 2.1325301204819275e-05,
+      "loss": 1.176,
+      "step": 60
+    },
+    {
+      "epoch": 0.19667170953101362,
+      "grad_norm": 0.47108614444732666,
+      "learning_rate": 2.313253012048193e-05,
+      "loss": 1.2043,
+      "step": 65
+    },
+    {
+      "epoch": 0.2118003025718608,
+      "grad_norm": 0.40567493438720703,
+      "learning_rate": 2.493975903614458e-05,
+      "loss": 1.1462,
+      "step": 70
+    },
+    {
+      "epoch": 0.22692889561270801,
+      "grad_norm": 0.4979853928089142,
+      "learning_rate": 2.674698795180723e-05,
+      "loss": 1.1362,
+      "step": 75
+    },
+    {
+      "epoch": 0.24205748865355523,
+      "grad_norm": 0.4744911193847656,
+      "learning_rate": 2.855421686746988e-05,
+      "loss": 1.1297,
+      "step": 80
+    },
+    {
+      "epoch": 0.25718608169440244,
+      "grad_norm": 0.48895782232284546,
+      "learning_rate": 2.9999970045934107e-05,
+      "loss": 1.0646,
+      "step": 85
+    },
+    {
+      "epoch": 0.2723146747352496,
+      "grad_norm": 0.5125440955162048,
+      "learning_rate": 2.999892166618921e-05,
+      "loss": 1.0775,
+      "step": 90
+    },
+    {
+      "epoch": 0.2874432677760968,
+      "grad_norm": 0.5728165507316589,
+      "learning_rate": 2.999637570278028e-05,
+      "loss": 1.1223,
+      "step": 95
+    },
+    {
+      "epoch": 0.30257186081694404,
+      "grad_norm": 0.5014736652374268,
+      "learning_rate": 2.999233240991181e-05,
+      "loss": 1.0431,
+      "step": 100
+    },
+    {
+      "epoch": 0.3177004538577912,
+      "grad_norm": 0.4918844699859619,
+      "learning_rate": 2.9986792191290767e-05,
+      "loss": 1.0938,
+      "step": 105
+    },
+    {
+      "epoch": 0.3328290468986384,
+      "grad_norm": 0.5097721219062805,
+      "learning_rate": 2.9979755600086323e-05,
+      "loss": 1.0524,
+      "step": 110
+    },
+    {
+      "epoch": 0.34795763993948564,
+      "grad_norm": 0.6224564909934998,
+      "learning_rate": 2.9971223338874577e-05,
+      "loss": 0.975,
+      "step": 115
+    },
+    {
+      "epoch": 0.3630862329803328,
+      "grad_norm": 0.6176092028617859,
+      "learning_rate": 2.996119625956845e-05,
+      "loss": 1.0153,
+      "step": 120
+    },
+    {
+      "epoch": 0.37821482602118,
+      "grad_norm": 0.5799441933631897,
+      "learning_rate": 2.994967536333258e-05,
+      "loss": 0.9895,
+      "step": 125
+    },
+    {
+      "epoch": 0.39334341906202724,
+      "grad_norm": 0.5250234007835388,
+      "learning_rate": 2.993666180048341e-05,
+      "loss": 1.0104,
+      "step": 130
+    },
+    {
+      "epoch": 0.4084720121028744,
+      "grad_norm": 0.6541566848754883,
+      "learning_rate": 2.992215687037428e-05,
+      "loss": 1.0159,
+      "step": 135
+    },
+    {
+      "epoch": 0.4236006051437216,
+      "grad_norm": 0.6040735244750977,
+      "learning_rate": 2.9906162021265736e-05,
+      "loss": 1.0339,
+      "step": 140
+    },
+    {
+      "epoch": 0.43872919818456885,
+      "grad_norm": 0.846281111240387,
+      "learning_rate": 2.9888678850180895e-05,
+      "loss": 0.9737,
+      "step": 145
+    },
+    {
+      "epoch": 0.45385779122541603,
+      "grad_norm": 0.6770375967025757,
+      "learning_rate": 2.986970910274601e-05,
+      "loss": 0.9342,
+      "step": 150
+    },
+    {
+      "epoch": 0.4689863842662632,
+      "grad_norm": 0.7691357135772705,
+      "learning_rate": 2.9849254673016178e-05,
+      "loss": 0.965,
+      "step": 155
+    },
+    {
+      "epoch": 0.48411497730711045,
+      "grad_norm": 0.608444094657898,
+      "learning_rate": 2.982731760328619e-05,
+      "loss": 0.9331,
+      "step": 160
+    },
+    {
+      "epoch": 0.49924357034795763,
+      "grad_norm": 0.7073132991790771,
+      "learning_rate": 2.980390008388667e-05,
+      "loss": 0.9161,
+      "step": 165
+    },
+    {
+      "epoch": 0.5143721633888049,
+      "grad_norm": 0.6664533615112305,
+      "learning_rate": 2.977900445296533e-05,
+      "loss": 0.9635,
+      "step": 170
+    },
+    {
+      "epoch": 0.529500756429652,
+      "grad_norm": 0.7164839506149292,
+      "learning_rate": 2.975263319625355e-05,
+      "loss": 0.9521,
+      "step": 175
+    },
+    {
+      "epoch": 0.5446293494704992,
+      "grad_norm": 0.7335389256477356,
+      "learning_rate": 2.972478894681817e-05,
+      "loss": 0.9104,
+      "step": 180
+    },
+    {
+      "epoch": 0.5597579425113465,
+      "grad_norm": 0.7356768250465393,
+      "learning_rate": 2.9695474484798582e-05,
+      "loss": 0.8796,
+      "step": 185
+    },
+    {
+      "epoch": 0.5748865355521936,
+      "grad_norm": 0.7879706621170044,
+      "learning_rate": 2.966469273712917e-05,
+      "loss": 0.9024,
+      "step": 190
+    },
+    {
+      "epoch": 0.5900151285930408,
+      "grad_norm": 0.8149765133857727,
+      "learning_rate": 2.9632446777247045e-05,
+      "loss": 0.8684,
+      "step": 195
+    },
+    {
+      "epoch": 0.6051437216338881,
+      "grad_norm": 0.718940794467926,
+      "learning_rate": 2.959873982478517e-05,
+      "loss": 0.9132,
+      "step": 200
+    },
+    {
+      "epoch": 0.6202723146747352,
+      "grad_norm": 0.7738745212554932,
+      "learning_rate": 2.956357524525093e-05,
+      "loss": 0.8179,
+      "step": 205
+    },
+    {
+      "epoch": 0.6354009077155824,
+      "grad_norm": 0.7994048595428467,
+      "learning_rate": 2.9526956549690037e-05,
+      "loss": 0.8375,
+      "step": 210
+    },
+    {
+      "epoch": 0.6505295007564297,
+      "grad_norm": 0.7944709062576294,
+      "learning_rate": 2.9488887394336025e-05,
+      "loss": 0.8244,
+      "step": 215
+    },
+    {
+      "epoch": 0.6656580937972768,
+      "grad_norm": 0.8380885124206543,
+      "learning_rate": 2.9449371580245162e-05,
+      "loss": 0.7798,
+      "step": 220
+    },
+    {
+      "epoch": 0.680786686838124,
+      "grad_norm": 0.8535066246986389,
+      "learning_rate": 2.9408413052916923e-05,
+      "loss": 0.8358,
+      "step": 225
+    },
+    {
+      "epoch": 0.6959152798789713,
+      "grad_norm": 0.8346946835517883,
+      "learning_rate": 2.9366015901900066e-05,
+      "loss": 0.8022,
+      "step": 230
+    },
+    {
+      "epoch": 0.7110438729198184,
+      "grad_norm": 0.840814471244812,
+      "learning_rate": 2.9322184360384297e-05,
+      "loss": 0.8327,
+      "step": 235
+    },
+    {
+      "epoch": 0.7261724659606656,
+      "grad_norm": 0.9543588757514954,
+      "learning_rate": 2.92769228047776e-05,
+      "loss": 0.7916,
+      "step": 240
+    },
+    {
+      "epoch": 0.7413010590015129,
+      "grad_norm": 0.9211695194244385,
+      "learning_rate": 2.923023575426927e-05,
+      "loss": 0.8282,
+      "step": 245
+    },
+    {
+      "epoch": 0.75642965204236,
+      "grad_norm": 0.9559474587440491,
+      "learning_rate": 2.91821278703787e-05,
+      "loss": 0.7517,
+      "step": 250
+    },
+    {
+      "epoch": 0.7715582450832073,
+      "grad_norm": 0.9239177107810974,
+      "learning_rate": 2.9132603956489934e-05,
+      "loss": 0.7995,
+      "step": 255
+    },
+    {
+      "epoch": 0.7866868381240545,
+      "grad_norm": 0.8963080048561096,
+      "learning_rate": 2.9081668957372072e-05,
+      "loss": 0.7688,
+      "step": 260
+    },
+    {
+      "epoch": 0.8018154311649016,
+      "grad_norm": 0.9856358766555786,
+      "learning_rate": 2.902932795868556e-05,
+      "loss": 0.8155,
+      "step": 265
+    },
+    {
+      "epoch": 0.8169440242057489,
+      "grad_norm": 0.8115594387054443,
+      "learning_rate": 2.8975586186474397e-05,
+      "loss": 0.7465,
+      "step": 270
+    },
+    {
+      "epoch": 0.8320726172465961,
+      "grad_norm": 0.8640703558921814,
+      "learning_rate": 2.8920449006644346e-05,
+      "loss": 0.7582,
+      "step": 275
+    },
+    {
+      "epoch": 0.8472012102874432,
+      "grad_norm": 0.8925158977508545,
+      "learning_rate": 2.886392192442715e-05,
+      "loss": 0.7413,
+      "step": 280
+    },
+    {
+      "epoch": 0.8623298033282905,
+      "grad_norm": 0.9191139340400696,
+      "learning_rate": 2.8806010583830885e-05,
+      "loss": 0.7536,
+      "step": 285
+    },
+    {
+      "epoch": 0.8774583963691377,
+      "grad_norm": 0.9398569464683533,
+      "learning_rate": 2.8746720767076403e-05,
+      "loss": 0.6988,
+      "step": 290
+    },
+    {
+      "epoch": 0.8925869894099848,
+      "grad_norm": 1.0098797082901,
+      "learning_rate": 2.8686058394020006e-05,
+      "loss": 0.7048,
+      "step": 295
+    },
+    {
+      "epoch": 0.9077155824508321,
+      "grad_norm": 0.9016486406326294,
+      "learning_rate": 2.862402952156238e-05,
+      "loss": 0.6867,
+      "step": 300
+    },
+    {
+      "epoch": 0.9228441754916793,
+      "grad_norm": 0.9746642708778381,
+      "learning_rate": 2.856064034304384e-05,
+      "loss": 0.6887,
+      "step": 305
+    },
+    {
+      "epoch": 0.9379727685325264,
+      "grad_norm": 0.9909067749977112,
+      "learning_rate": 2.849589718762592e-05,
+      "loss": 0.7016,
+      "step": 310
+    },
+    {
+      "epoch": 0.9531013615733737,
+      "grad_norm": 1.0518617630004883,
+      "learning_rate": 2.8429806519659463e-05,
+      "loss": 0.7014,
+      "step": 315
+    },
+    {
+      "epoch": 0.9682299546142209,
+      "grad_norm": 1.1244536638259888,
+      "learning_rate": 2.8362374938039174e-05,
+      "loss": 0.6708,
+      "step": 320
+    },
+    {
+      "epoch": 0.983358547655068,
+      "grad_norm": 1.0927412509918213,
+      "learning_rate": 2.8293609175544737e-05,
+      "loss": 0.6681,
+      "step": 325
+    },
+    {
+      "epoch": 0.9984871406959153,
+      "grad_norm": 0.940122127532959,
+      "learning_rate": 2.8223516098168573e-05,
+      "loss": 0.6809,
+      "step": 330
+    },
+    {
+      "epoch": 1.0121028744326777,
+      "grad_norm": 1.1608147621154785,
+      "learning_rate": 2.8152102704430312e-05,
+      "loss": 0.6216,
+      "step": 335
+    },
+    {
+      "epoch": 1.027231467473525,
+      "grad_norm": 1.0822938680648804,
+      "learning_rate": 2.8079376124678e-05,
+      "loss": 0.602,
+      "step": 340
+    },
+    {
+      "epoch": 1.0423600605143721,
+      "grad_norm": 1.149957537651062,
+      "learning_rate": 2.800534362037618e-05,
+      "loss": 0.5483,
+      "step": 345
+    },
+    {
+      "epoch": 1.0574886535552193,
+      "grad_norm": 1.0549287796020508,
+      "learning_rate": 2.793001258338084e-05,
+      "loss": 0.5477,
+      "step": 350
+    },
+    {
+      "epoch": 1.0726172465960666,
+      "grad_norm": 1.0644642114639282,
+      "learning_rate": 2.7853390535201396e-05,
+      "loss": 0.5725,
+      "step": 355
+    },
+    {
+      "epoch": 1.0877458396369137,
+      "grad_norm": 1.1603634357452393,
+      "learning_rate": 2.7775485126249665e-05,
+      "loss": 0.5744,
+      "step": 360
+    },
+    {
+      "epoch": 1.102874432677761,
+      "grad_norm": 1.0989371538162231,
+      "learning_rate": 2.7696304135076024e-05,
+      "loss": 0.5413,
+      "step": 365
+    },
+    {
+      "epoch": 1.1180030257186082,
+      "grad_norm": 1.0977047681808472,
+      "learning_rate": 2.7615855467592756e-05,
+      "loss": 0.6023,
+      "step": 370
+    },
+    {
+      "epoch": 1.1331316187594553,
+      "grad_norm": 1.0698915719985962,
+      "learning_rate": 2.753414715628464e-05,
+      "loss": 0.5721,
+      "step": 375
+    },
+    {
+      "epoch": 1.1482602118003027,
+      "grad_norm": 1.1103595495224,
+      "learning_rate": 2.745118735940699e-05,
+      "loss": 0.5445,
+      "step": 380
+    },
+    {
+      "epoch": 1.1633888048411498,
+      "grad_norm": 1.0626442432403564,
+      "learning_rate": 2.7366984360171047e-05,
+      "loss": 0.5558,
+      "step": 385
+    },
+    {
+      "epoch": 1.178517397881997,
+      "grad_norm": 1.076282024383545,
+      "learning_rate": 2.7281546565916948e-05,
+      "loss": 0.487,
+      "step": 390
+    },
+    {
+      "epoch": 1.1936459909228443,
+      "grad_norm": 1.100417971611023,
+      "learning_rate": 2.719488250727427e-05,
+      "loss": 0.5346,
+      "step": 395
+    },
+    {
+      "epoch": 1.2087745839636914,
+      "grad_norm": 1.079980731010437,
+      "learning_rate": 2.710700083731032e-05,
+      "loss": 0.507,
+      "step": 400
+    },
+    {
+      "epoch": 1.2239031770045385,
+      "grad_norm": 1.086501955986023,
+      "learning_rate": 2.701791033066612e-05,
+      "loss": 0.5573,
+      "step": 405
+    },
+    {
+      "epoch": 1.239031770045386,
+      "grad_norm": 1.3494257926940918,
+      "learning_rate": 2.6927619882680286e-05,
+      "loss": 0.5116,
+      "step": 410
+    },
+    {
+      "epoch": 1.254160363086233,
+      "grad_norm": 1.0927280187606812,
+      "learning_rate": 2.6836138508500918e-05,
+      "loss": 0.5368,
+      "step": 415
+    },
+    {
+      "epoch": 1.2692889561270801,
+      "grad_norm": 1.0733301639556885,
+      "learning_rate": 2.6743475342185414e-05,
+      "loss": 0.5352,
+      "step": 420
+    },
+    {
+      "epoch": 1.2844175491679275,
+      "grad_norm": 1.0263490676879883,
+      "learning_rate": 2.664963963578851e-05,
+      "loss": 0.5274,
+      "step": 425
+    },
+    {
+      "epoch": 1.2995461422087746,
+      "grad_norm": 1.1017117500305176,
+      "learning_rate": 2.655464075843847e-05,
+      "loss": 0.5247,
+      "step": 430
+    },
+    {
+      "epoch": 1.3146747352496218,
+      "grad_norm": 1.308171033859253,
+      "learning_rate": 2.6458488195401636e-05,
+      "loss": 0.534,
+      "step": 435
+    },
+    {
+      "epoch": 1.329803328290469,
+      "grad_norm": 1.0724010467529297,
+      "learning_rate": 2.6361191547135355e-05,
+      "loss": 0.4958,
+      "step": 440
+    },
+    {
+      "epoch": 1.3449319213313162,
+      "grad_norm": 1.0588688850402832,
+      "learning_rate": 2.62627605283294e-05,
+      "loss": 0.4838,
+      "step": 445
+    },
+    {
+      "epoch": 1.3600605143721634,
+      "grad_norm": 1.0089439153671265,
+      "learning_rate": 2.6163204966936022e-05,
+      "loss": 0.5041,
+      "step": 450
+    },
+    {
+      "epoch": 1.3751891074130107,
+      "grad_norm": 1.1149275302886963,
+      "learning_rate": 2.6062534803188628e-05,
+      "loss": 0.4661,
+      "step": 455
+    },
+    {
+      "epoch": 1.3903177004538578,
+      "grad_norm": 1.2650748491287231,
+      "learning_rate": 2.596076008860933e-05,
+      "loss": 0.4994,
+      "step": 460
+    },
+    {
+      "epoch": 1.405446293494705,
+      "grad_norm": 1.2204926013946533,
+      "learning_rate": 2.5857890985005315e-05,
+      "loss": 0.4399,
+      "step": 465
+    },
+    {
+      "epoch": 1.4205748865355523,
+      "grad_norm": 1.078391194343567,
+      "learning_rate": 2.5753937763454233e-05,
+      "loss": 0.5202,
+      "step": 470
+    },
+    {
+      "epoch": 1.4357034795763994,
+      "grad_norm": 1.0975085496902466,
+      "learning_rate": 2.5648910803278662e-05,
+      "loss": 0.5036,
+      "step": 475
+    },
+    {
+      "epoch": 1.4508320726172466,
+      "grad_norm": 1.299043893814087,
+      "learning_rate": 2.55428205910098e-05,
+      "loss": 0.495,
+      "step": 480
+    },
+    {
+      "epoch": 1.465960665658094,
+      "grad_norm": 1.2129578590393066,
+      "learning_rate": 2.543567771934039e-05,
+      "loss": 0.5012,
+      "step": 485
+    },
+    {
+      "epoch": 1.481089258698941,
+      "grad_norm": 1.2743144035339355,
+      "learning_rate": 2.5327492886067115e-05,
+      "loss": 0.5098,
+      "step": 490
+    },
+    {
+      "epoch": 1.4962178517397882,
+      "grad_norm": 1.2529656887054443,
+      "learning_rate": 2.5218276893022435e-05,
+      "loss": 0.4847,
+      "step": 495
+    },
+    {
+      "epoch": 1.5113464447806355,
+      "grad_norm": 1.0965994596481323,
+      "learning_rate": 2.5108040644996087e-05,
+      "loss": 0.4372,
+      "step": 500
+    },
+    {
+      "epoch": 1.5264750378214826,
+      "grad_norm": 1.1192255020141602,
+      "learning_rate": 2.4996795148646283e-05,
+      "loss": 0.4632,
+      "step": 505
+    },
+    {
+      "epoch": 1.5416036308623298,
+      "grad_norm": 1.0835782289505005,
+      "learning_rate": 2.4884551511400714e-05,
+      "loss": 0.4551,
+      "step": 510
+    },
+    {
+      "epoch": 1.5567322239031771,
+      "grad_norm": 1.0586811304092407,
+      "learning_rate": 2.4771320940347554e-05,
+      "loss": 0.439,
+      "step": 515
+    },
+    {
+      "epoch": 1.5718608169440242,
+      "grad_norm": 1.0878212451934814,
+      "learning_rate": 2.4657114741116458e-05,
+      "loss": 0.4175,
+      "step": 520
+    },
+    {
+      "epoch": 1.5869894099848714,
+      "grad_norm": 1.1067463159561157,
+      "learning_rate": 2.454194431674972e-05,
+      "loss": 0.4371,
+      "step": 525
+    },
+    {
+      "epoch": 1.6021180030257187,
+      "grad_norm": 1.3756803274154663,
+      "learning_rate": 2.4425821166563757e-05,
+      "loss": 0.4379,
+      "step": 530
+    },
+    {
+      "epoch": 1.6172465960665658,
+      "grad_norm": 1.1154321432113647,
+      "learning_rate": 2.4308756885000928e-05,
+      "loss": 0.4382,
+      "step": 535
+    },
+    {
+      "epoch": 1.632375189107413,
+      "grad_norm": 1.2593263387680054,
+      "learning_rate": 2.419076316047189e-05,
+      "loss": 0.4171,
+      "step": 540
+    },
+    {
+      "epoch": 1.6475037821482603,
+      "grad_norm": 1.035703420639038,
+      "learning_rate": 2.407185177418853e-05,
+      "loss": 0.4212,
+      "step": 545
+    },
+    {
+      "epoch": 1.6626323751891074,
+      "grad_norm": 1.7641640901565552,
+      "learning_rate": 2.3952034598987677e-05,
+      "loss": 0.4193,
+      "step": 550
+    },
+    {
+      "epoch": 1.6777609682299546,
+      "grad_norm": 1.1478875875473022,
+      "learning_rate": 2.3831323598145644e-05,
+      "loss": 0.4529,
+      "step": 555
+    },
+    {
+      "epoch": 1.692889561270802,
+      "grad_norm": 1.028493046760559,
+      "learning_rate": 2.370973082418374e-05,
+      "loss": 0.4165,
+      "step": 560
+    },
+    {
+      "epoch": 1.708018154311649,
+      "grad_norm": 1.1380536556243896,
+      "learning_rate": 2.3587268417664848e-05,
+      "loss": 0.4264,
+      "step": 565
+    },
+    {
+      "epoch": 1.7231467473524962,
+      "grad_norm": 1.0838291645050049,
+      "learning_rate": 2.34639486059813e-05,
+      "loss": 0.4264,
+      "step": 570
+    },
+    {
+      "epoch": 1.7382753403933435,
+      "grad_norm": 1.1299512386322021,
+      "learning_rate": 2.3339783702133955e-05,
+      "loss": 0.4051,
+      "step": 575
+    },
+    {
+      "epoch": 1.7534039334341907,
+      "grad_norm": 1.0154179334640503,
+      "learning_rate": 2.321478610350282e-05,
+      "loss": 0.3535,
+      "step": 580
+    },
+    {
+      "epoch": 1.7685325264750378,
+      "grad_norm": 1.2361255884170532,
+      "learning_rate": 2.3088968290609223e-05,
+      "loss": 0.3816,
+      "step": 585
+    },
+    {
+      "epoch": 1.7836611195158851,
+      "grad_norm": 1.1661337614059448,
+      "learning_rate": 2.2962342825869684e-05,
+      "loss": 0.3797,
+      "step": 590
+    },
+    {
+      "epoch": 1.7987897125567323,
+      "grad_norm": 1.0582844018936157,
+      "learning_rate": 2.2834922352341587e-05,
+      "loss": 0.3807,
+      "step": 595
+    },
+    {
+      "epoch": 1.8139183055975794,
+      "grad_norm": 1.287212610244751,
+      "learning_rate": 2.2706719592460843e-05,
+      "loss": 0.4034,
+      "step": 600
+    },
+    {
+      "epoch": 1.8290468986384267,
+      "grad_norm": 1.263323187828064,
+      "learning_rate": 2.25777473467716e-05,
+      "loss": 0.4052,
+      "step": 605
+    },
+    {
+      "epoch": 1.8441754916792739,
+      "grad_norm": 0.9774038195610046,
+      "learning_rate": 2.2448018492648147e-05,
+      "loss": 0.3838,
+      "step": 610
+    },
+    {
+      "epoch": 1.859304084720121,
+      "grad_norm": 1.2147777080535889,
+      "learning_rate": 2.2317545983009166e-05,
+      "loss": 0.4093,
+      "step": 615
+    },
+    {
+      "epoch": 1.8744326777609683,
+      "grad_norm": 1.171796202659607,
+      "learning_rate": 2.218634284502444e-05,
+      "loss": 0.3878,
+      "step": 620
+    },
+    {
+      "epoch": 1.8895612708018155,
+      "grad_norm": 1.2053415775299072,
+      "learning_rate": 2.205442217881412e-05,
+      "loss": 0.3768,
+      "step": 625
+    },
+    {
+      "epoch": 1.9046898638426626,
+      "grad_norm": 1.192639946937561,
+      "learning_rate": 2.192179715614077e-05,
+      "loss": 0.398,
+      "step": 630
+    },
+    {
+      "epoch": 1.91981845688351,
+      "grad_norm": 1.2259142398834229,
+      "learning_rate": 2.1788481019094164e-05,
+      "loss": 0.3821,
+      "step": 635
+    },
+    {
+      "epoch": 1.934947049924357,
+      "grad_norm": 1.2637989521026611,
+      "learning_rate": 2.165448707876916e-05,
+      "loss": 0.3581,
+      "step": 640
+    },
+    {
+      "epoch": 1.9500756429652042,
+      "grad_norm": 1.0719422101974487,
+      "learning_rate": 2.1519828713936614e-05,
+      "loss": 0.3681,
+      "step": 645
+    },
+    {
+      "epoch": 1.9652042360060515,
+      "grad_norm": 1.1670970916748047,
+      "learning_rate": 2.138451936970757e-05,
+      "loss": 0.3546,
+      "step": 650
+    },
+    {
+      "epoch": 1.9803328290468987,
+      "grad_norm": 1.2138304710388184,
+      "learning_rate": 2.1248572556190837e-05,
+      "loss": 0.3797,
+      "step": 655
+    },
+    {
+      "epoch": 1.9954614220877458,
+      "grad_norm": 1.2247320413589478,
+      "learning_rate": 2.1112001847144013e-05,
+      "loss": 0.299,
+      "step": 660
+    },
+    {
+      "epoch": 2.0090771558245084,
+      "grad_norm": 1.101378083229065,
+      "learning_rate": 2.097482087861824e-05,
+      "loss": 0.3227,
+      "step": 665
+    },
+    {
+      "epoch": 2.0242057488653553,
+      "grad_norm": 1.1532782316207886,
+      "learning_rate": 2.0837043347596675e-05,
+      "loss": 0.255,
+      "step": 670
+    },
+    {
+      "epoch": 2.0393343419062027,
+      "grad_norm": 1.2093374729156494,
+      "learning_rate": 2.069868301062691e-05,
+      "loss": 0.2942,
+      "step": 675
+    },
+    {
+      "epoch": 2.05446293494705,
+      "grad_norm": 1.0964783430099487,
+      "learning_rate": 2.0559753682447436e-05,
+      "loss": 0.2824,
+      "step": 680
+    },
+    {
+      "epoch": 2.069591527987897,
+      "grad_norm": 1.0570346117019653,
+      "learning_rate": 2.0420269234608282e-05,
+      "loss": 0.2542,
+      "step": 685
+    },
+    {
+      "epoch": 2.0847201210287443,
+      "grad_norm": 1.1446951627731323,
+      "learning_rate": 2.0280243594086013e-05,
+      "loss": 0.2747,
+      "step": 690
+    },
+    {
+      "epoch": 2.0998487140695916,
+      "grad_norm": 1.475313425064087,
+      "learning_rate": 2.0139690741893152e-05,
+      "loss": 0.2992,
+      "step": 695
+    },
+    {
+      "epoch": 2.1149773071104385,
+      "grad_norm": 1.0060259103775024,
+      "learning_rate": 1.999862471168226e-05,
+      "loss": 0.2619,
+      "step": 700
+    },
+    {
+      "epoch": 2.130105900151286,
+      "grad_norm": 1.215386986732483,
+      "learning_rate": 1.985705958834471e-05,
+      "loss": 0.2676,
+      "step": 705
+    },
+    {
+      "epoch": 2.145234493192133,
+      "grad_norm": 1.0537375211715698,
+      "learning_rate": 1.9715009506604383e-05,
+      "loss": 0.2947,
+      "step": 710
+    },
+    {
+      "epoch": 2.16036308623298,
+      "grad_norm": 1.187626838684082,
+      "learning_rate": 1.9572488649606335e-05,
+      "loss": 0.2658,
+      "step": 715
+    },
+    {
+      "epoch": 2.1754916792738275,
+      "grad_norm": 1.119370698928833,
+      "learning_rate": 1.942951124750071e-05,
+      "loss": 0.2679,
+      "step": 720
+    },
+    {
+      "epoch": 2.190620272314675,
+      "grad_norm": 1.030710220336914,
+      "learning_rate": 1.928609157602189e-05,
+      "loss": 0.2525,
+      "step": 725
+    },
+    {
+      "epoch": 2.205748865355522,
+      "grad_norm": 1.0458844900131226,
+      "learning_rate": 1.9142243955063125e-05,
+      "loss": 0.2608,
+      "step": 730
+    },
+    {
+      "epoch": 2.220877458396369,
+      "grad_norm": 1.1421456336975098,
+      "learning_rate": 1.8997982747246744e-05,
+      "loss": 0.2771,
+      "step": 735
+    },
+    {
+      "epoch": 2.2360060514372164,
+      "grad_norm": 1.0802360773086548,
+      "learning_rate": 1.8853322356490104e-05,
+      "loss": 0.2593,
+      "step": 740
+    },
+    {
+      "epoch": 2.2511346444780633,
+      "grad_norm": 1.1976512670516968,
+      "learning_rate": 1.870827722656743e-05,
+      "loss": 0.2663,
+      "step": 745
+    },
+    {
+      "epoch": 2.2662632375189107,
+      "grad_norm": 1.2698267698287964,
+      "learning_rate": 1.8562861839667635e-05,
+      "loss": 0.2543,
+      "step": 750
+    },
+    {
+      "epoch": 2.281391830559758,
+      "grad_norm": 1.14236581325531,
+      "learning_rate": 1.8417090714948337e-05,
+      "loss": 0.2514,
+      "step": 755
+    },
+    {
+      "epoch": 2.2965204236006054,
+      "grad_norm": 1.158115029335022,
+      "learning_rate": 1.827097840708621e-05,
+      "loss": 0.2572,
+      "step": 760
+    },
+    {
+      "epoch": 2.3116490166414523,
+      "grad_norm": 1.258471965789795,
+      "learning_rate": 1.8124539504823704e-05,
+      "loss": 0.2505,
+      "step": 765
+    },
+    {
+      "epoch": 2.3267776096822996,
+      "grad_norm": 1.1164860725402832,
+      "learning_rate": 1.7977788629512457e-05,
+      "loss": 0.2448,
+      "step": 770
+    },
+    {
+      "epoch": 2.3419062027231465,
+      "grad_norm": 1.2983452081680298,
+      "learning_rate": 1.783074043365339e-05,
+      "loss": 0.2898,
+      "step": 775
+    },
+    {
+      "epoch": 2.357034795763994,
+      "grad_norm": 1.1434799432754517,
+      "learning_rate": 1.7683409599433716e-05,
+      "loss": 0.2645,
+      "step": 780
+    },
+    {
+      "epoch": 2.3721633888048412,
+      "grad_norm": 1.1078280210494995,
+      "learning_rate": 1.7535810837260996e-05,
+      "loss": 0.2317,
+      "step": 785
+    },
+    {
+      "epoch": 2.3872919818456886,
+      "grad_norm": 1.0709105730056763,
+      "learning_rate": 1.7387958884294325e-05,
+      "loss": 0.2591,
+      "step": 790
+    },
+    {
+      "epoch": 2.4024205748865355,
+      "grad_norm": 1.1712156534194946,
+      "learning_rate": 1.723986850297293e-05,
+      "loss": 0.2459,
+      "step": 795
+    },
+    {
+      "epoch": 2.417549167927383,
+      "grad_norm": 1.2031867504119873,
+      "learning_rate": 1.7091554479542172e-05,
+      "loss": 0.2292,
+      "step": 800
+    },
+    {
+      "epoch": 2.4326777609682297,
+      "grad_norm": 1.15472412109375,
+      "learning_rate": 1.6943031622577197e-05,
+      "loss": 0.2307,
+      "step": 805
+    },
+    {
+      "epoch": 2.447806354009077,
+      "grad_norm": 1.0374424457550049,
+      "learning_rate": 1.6794314761504362e-05,
+      "loss": 0.2286,
+      "step": 810
+    },
+    {
+      "epoch": 2.4629349470499244,
+      "grad_norm": 1.168131947517395,
+      "learning_rate": 1.6645418745120583e-05,
+      "loss": 0.2454,
+      "step": 815
+    },
+    {
+      "epoch": 2.478063540090772,
+      "grad_norm": 1.1437524557113647,
+      "learning_rate": 1.6496358440110725e-05,
+      "loss": 0.2301,
+      "step": 820
+    },
+    {
+      "epoch": 2.4931921331316187,
+      "grad_norm": 1.0511492490768433,
+      "learning_rate": 1.6347148729563236e-05,
+      "loss": 0.2301,
+      "step": 825
+    },
+    {
+      "epoch": 2.508320726172466,
+      "grad_norm": 1.3130929470062256,
+      "learning_rate": 1.6197804511484115e-05,
+      "loss": 0.2288,
+      "step": 830
+    },
+    {
+      "epoch": 2.523449319213313,
+      "grad_norm": 1.0749022960662842,
+      "learning_rate": 1.604834069730942e-05,
+      "loss": 0.2339,
+      "step": 835
+    },
+    {
+      "epoch": 2.5385779122541603,
+      "grad_norm": 1.1725904941558838,
+      "learning_rate": 1.589877221041641e-05,
+      "loss": 0.2486,
+      "step": 840
+    },
+    {
+      "epoch": 2.5537065052950076,
+      "grad_norm": 1.171730875968933,
+      "learning_rate": 1.5749113984633504e-05,
+      "loss": 0.2254,
+      "step": 845
+    },
+    {
+      "epoch": 2.568835098335855,
+      "grad_norm": 1.1446218490600586,
+      "learning_rate": 1.5599380962749188e-05,
+      "loss": 0.2028,
+      "step": 850
+    },
+    {
+      "epoch": 2.583963691376702,
+      "grad_norm": 1.14954674243927,
+      "learning_rate": 1.5449588095020064e-05,
+      "loss": 0.2465,
+      "step": 855
+    },
+    {
+      "epoch": 2.5990922844175492,
+      "grad_norm": 1.1885757446289062,
+      "learning_rate": 1.5299750337678096e-05,
+      "loss": 0.2285,
+      "step": 860
+    },
+    {
+      "epoch": 2.614220877458396,
+      "grad_norm": 1.0606110095977783,
+      "learning_rate": 1.514988265143731e-05,
+      "loss": 0.2075,
+      "step": 865
+    },
+    {
+      "epoch": 2.6293494704992435,
+      "grad_norm": 1.3303227424621582,
+      "learning_rate": 1.5e-05,
+      "loss": 0.2273,
+      "step": 870
+    },
+    {
+      "epoch": 2.644478063540091,
+      "grad_norm": 0.9864528179168701,
+      "learning_rate": 1.4850117348562696e-05,
+      "loss": 0.2005,
+      "step": 875
+    },
+    {
+      "epoch": 2.659606656580938,
+      "grad_norm": 1.1116912364959717,
+      "learning_rate": 1.4700249662321903e-05,
+      "loss": 0.2165,
+      "step": 880
+    },
+    {
+      "epoch": 2.674735249621785,
+      "grad_norm": 1.2655029296875,
+      "learning_rate": 1.4550411904979939e-05,
+      "loss": 0.2307,
+      "step": 885
+    },
+    {
+      "epoch": 2.6898638426626325,
+      "grad_norm": 1.167178750038147,
+      "learning_rate": 1.440061903725082e-05,
+      "loss": 0.2081,
+      "step": 890
+    },
+    {
+      "epoch": 2.7049924357034794,
+      "grad_norm": 1.1951887607574463,
+      "learning_rate": 1.4250886015366502e-05,
+      "loss": 0.186,
+      "step": 895
+    },
+    {
+      "epoch": 2.7201210287443267,
+      "grad_norm": 1.2827143669128418,
+      "learning_rate": 1.4101227789583594e-05,
+      "loss": 0.1999,
+      "step": 900
+    },
+    {
+      "epoch": 2.735249621785174,
+      "grad_norm": 1.1490733623504639,
+      "learning_rate": 1.3951659302690578e-05,
+      "loss": 0.2209,
+      "step": 905
+    },
+    {
+      "epoch": 2.7503782148260214,
+      "grad_norm": 1.126470685005188,
+      "learning_rate": 1.3802195488515886e-05,
+      "loss": 0.2234,
+      "step": 910
+    },
+    {
+      "epoch": 2.7655068078668683,
+      "grad_norm": 1.1791654825210571,
+      "learning_rate": 1.3652851270436768e-05,
+      "loss": 0.2272,
+      "step": 915
+    },
+    {
+      "epoch": 2.7806354009077157,
+      "grad_norm": 1.1004488468170166,
+      "learning_rate": 1.3503641559889274e-05,
+      "loss": 0.1967,
+      "step": 920
+    },
+    {
+      "epoch": 2.7957639939485626,
+      "grad_norm": 1.0978676080703735,
+      "learning_rate": 1.335458125487942e-05,
+      "loss": 0.2102,
+      "step": 925
+    },
+    {
+      "epoch": 2.81089258698941,
+      "grad_norm": 1.3805896043777466,
+      "learning_rate": 1.3205685238495642e-05,
+      "loss": 0.1822,
+      "step": 930
+    },
+    {
+      "epoch": 2.8260211800302573,
+      "grad_norm": 1.0566489696502686,
+      "learning_rate": 1.3056968377422804e-05,
+      "loss": 0.1914,
+      "step": 935
+    },
+    {
+      "epoch": 2.8411497730711046,
+      "grad_norm": 1.0609066486358643,
+      "learning_rate": 1.2908445520457832e-05,
+      "loss": 0.2089,
+      "step": 940
+    },
+    {
+      "epoch": 2.8562783661119515,
+      "grad_norm": 1.1454850435256958,
+      "learning_rate": 1.2760131497027073e-05,
+      "loss": 0.1931,
+      "step": 945
+    },
+    {
+      "epoch": 2.871406959152799,
+      "grad_norm": 1.126051902770996,
+      "learning_rate": 1.2612041115705679e-05,
+      "loss": 0.2069,
+      "step": 950
+    },
+    {
+      "epoch": 2.8865355521936458,
+      "grad_norm": 1.1702042818069458,
+      "learning_rate": 1.2464189162739012e-05,
+      "loss": 0.1888,
+      "step": 955
+    },
+    {
+      "epoch": 2.901664145234493,
+      "grad_norm": 1.1732772588729858,
+      "learning_rate": 1.2316590400566286e-05,
+      "loss": 0.1928,
+      "step": 960
+    },
+    {
+      "epoch": 2.9167927382753405,
+      "grad_norm": 1.3314909934997559,
+      "learning_rate": 1.2169259566346612e-05,
+      "loss": 0.2029,
+      "step": 965
+    },
+    {
+      "epoch": 2.931921331316188,
+      "grad_norm": 1.1199489831924438,
+      "learning_rate": 1.2022211370487546e-05,
+      "loss": 0.1871,
+      "step": 970
+    },
+    {
+      "epoch": 2.9470499243570347,
+      "grad_norm": 1.098125696182251,
+      "learning_rate": 1.1875460495176297e-05,
+      "loss": 0.1638,
+      "step": 975
+    },
+    {
+      "epoch": 2.962178517397882,
+      "grad_norm": 1.1100715398788452,
+      "learning_rate": 1.1729021592913791e-05,
+      "loss": 0.2053,
+      "step": 980
+    },
+    {
+      "epoch": 2.977307110438729,
+      "grad_norm": 1.1177160739898682,
+      "learning_rate": 1.1582909285051664e-05,
+      "loss": 0.1992,
+      "step": 985
+    },
+    {
+      "epoch": 2.9924357034795763,
+      "grad_norm": 1.1357407569885254,
+      "learning_rate": 1.1437138160332371e-05,
+      "loss": 0.1834,
+      "step": 990
+    },
+    {
+      "epoch": 3.006051437216339,
+      "grad_norm": 1.0767449140548706,
+      "learning_rate": 1.1291722773432571e-05,
+      "loss": 0.1724,
+      "step": 995
+    },
+    {
+      "epoch": 3.0211800302571863,
+      "grad_norm": 1.0442917346954346,
+      "learning_rate": 1.1146677643509893e-05,
+      "loss": 0.1403,
+      "step": 1000
+    },
+    {
+      "epoch": 3.036308623298033,
+      "grad_norm": 1.0078842639923096,
+      "learning_rate": 1.100201725275326e-05,
+      "loss": 0.1385,
+      "step": 1005
+    },
+    {
+      "epoch": 3.0514372163388805,
+      "grad_norm": 0.955449104309082,
+      "learning_rate": 1.0857756044936876e-05,
+      "loss": 0.1407,
+      "step": 1010
+    },
+    {
+      "epoch": 3.066565809379728,
+      "grad_norm": 1.0730222463607788,
+      "learning_rate": 1.0713908423978111e-05,
+      "loss": 0.1367,
+      "step": 1015
+    },
+    {
+      "epoch": 3.081694402420575,
+      "grad_norm": 1.2181591987609863,
+      "learning_rate": 1.0570488752499293e-05,
+      "loss": 0.1622,
+      "step": 1020
+    },
+    {
+      "epoch": 3.096822995461422,
+      "grad_norm": 1.12200129032135,
+      "learning_rate": 1.042751135039367e-05,
+      "loss": 0.1337,
+      "step": 1025
+    },
+    {
+      "epoch": 3.1119515885022695,
+      "grad_norm": 1.044112205505371,
+      "learning_rate": 1.028499049339562e-05,
+      "loss": 0.1385,
+      "step": 1030
+    },
+    {
+      "epoch": 3.1270801815431164,
+      "grad_norm": 1.2044527530670166,
+      "learning_rate": 1.014294041165529e-05,
+      "loss": 0.1536,
+      "step": 1035
+    },
+    {
+      "epoch": 3.1422087745839637,
+      "grad_norm": 0.9035820364952087,
+      "learning_rate": 1.000137528831774e-05,
+      "loss": 0.1229,
+      "step": 1040
+    },
+    {
+      "epoch": 3.157337367624811,
+      "grad_norm": 1.0843486785888672,
+      "learning_rate": 9.860309258106854e-06,
+      "loss": 0.1419,
+      "step": 1045
+    },
+    {
+      "epoch": 3.172465960665658,
+      "grad_norm": 1.26431405544281,
+      "learning_rate": 9.719756405913997e-06,
+      "loss": 0.1414,
+      "step": 1050
+    },
+    {
+      "epoch": 3.1875945537065054,
+      "grad_norm": 1.0138827562332153,
+      "learning_rate": 9.57973076539172e-06,
+      "loss": 0.138,
+      "step": 1055
+    },
+    {
+      "epoch": 3.2027231467473527,
+      "grad_norm": 1.0034918785095215,
+      "learning_rate": 9.440246317552568e-06,
+      "loss": 0.1295,
+      "step": 1060
+    },
+    {
+      "epoch": 3.2178517397881996,
+      "grad_norm": 1.10910165309906,
+      "learning_rate": 9.301316989373092e-06,
+      "loss": 0.1449,
+      "step": 1065
+    },
+    {
+      "epoch": 3.232980332829047,
+      "grad_norm": 0.9604946970939636,
+      "learning_rate": 9.162956652403324e-06,
+      "loss": 0.1227,
+      "step": 1070
+    },
+    {
+      "epoch": 3.2481089258698943,
+      "grad_norm": 1.1060423851013184,
+      "learning_rate": 9.025179121381763e-06,
+      "loss": 0.1541,
+      "step": 1075
+    },
+    {
+      "epoch": 3.263237518910741,
+      "grad_norm": 1.052452802658081,
+      "learning_rate": 8.88799815285599e-06,
+      "loss": 0.1356,
+      "step": 1080
+    },
+    {
+      "epoch": 3.2783661119515886,
+      "grad_norm": 1.1046732664108276,
+      "learning_rate": 8.751427443809163e-06,
+      "loss": 0.1313,
+      "step": 1085
+    },
+    {
+      "epoch": 3.293494704992436,
+      "grad_norm": 1.0501445531845093,
+      "learning_rate": 8.615480630292426e-06,
+      "loss": 0.1194,
+      "step": 1090
+    },
+    {
+      "epoch": 3.308623298033283,
+      "grad_norm": 1.0697016716003418,
+      "learning_rate": 8.480171286063389e-06,
+      "loss": 0.1469,
+      "step": 1095
+    },
+    {
+      "epoch": 3.32375189107413,
+      "grad_norm": 1.036516785621643,
+      "learning_rate": 8.34551292123085e-06,
+      "loss": 0.1287,
+      "step": 1100
+    },
+    {
+      "epoch": 3.338880484114977,
+      "grad_norm": 0.9603574872016907,
+      "learning_rate": 8.211518980905842e-06,
+      "loss": 0.1261,
+      "step": 1105
+    },
+    {
+      "epoch": 3.3540090771558244,
+      "grad_norm": 1.0846987962722778,
+      "learning_rate": 8.078202843859234e-06,
+      "loss": 0.1188,
+      "step": 1110
+    },
+    {
+      "epoch": 3.3691376701966718,
+      "grad_norm": 1.0942096710205078,
+      "learning_rate": 7.94557782118588e-06,
+      "loss": 0.1252,
+      "step": 1115
+    },
+    {
+      "epoch": 3.384266263237519,
+      "grad_norm": 1.0738424062728882,
+      "learning_rate": 7.813657154975566e-06,
+      "loss": 0.1358,
+      "step": 1120
+    },
+    {
+      "epoch": 3.399394856278366,
+      "grad_norm": 1.0699353218078613,
+      "learning_rate": 7.682454016990836e-06,
+      "loss": 0.1179,
+      "step": 1125
+    },
+    {
+      "epoch": 3.4145234493192134,
+      "grad_norm": 1.213776707649231,
+      "learning_rate": 7.5519815073518615e-06,
+      "loss": 0.1243,
+      "step": 1130
+    },
+    {
+      "epoch": 3.4296520423600603,
+      "grad_norm": 1.0862241983413696,
+      "learning_rate": 7.422252653228401e-06,
+      "loss": 0.1322,
+      "step": 1135
+    },
+    {
+      "epoch": 3.4447806354009076,
+      "grad_norm": 0.9266173243522644,
+      "learning_rate": 7.293280407539161e-06,
+      "loss": 0.127,
+      "step": 1140
+    },
+    {
+      "epoch": 3.459909228441755,
+      "grad_norm": 0.983642041683197,
+      "learning_rate": 7.16507764765842e-06,
+      "loss": 0.1192,
+      "step": 1145
+    },
+    {
+      "epoch": 3.4750378214826023,
+      "grad_norm": 0.9537131786346436,
+      "learning_rate": 7.037657174130322e-06,
+      "loss": 0.1317,
+      "step": 1150
+    },
+    {
+      "epoch": 3.4901664145234492,
+      "grad_norm": 1.0515652894973755,
+      "learning_rate": 6.911031709390778e-06,
+      "loss": 0.1164,
+      "step": 1155
+    },
+    {
+      "epoch": 3.5052950075642966,
+      "grad_norm": 1.1290754079818726,
+      "learning_rate": 6.785213896497187e-06,
+      "loss": 0.1009,
+      "step": 1160
+    },
+    {
+      "epoch": 3.5204236006051435,
+      "grad_norm": 1.0471539497375488,
+      "learning_rate": 6.660216297866044e-06,
+      "loss": 0.1202,
+      "step": 1165
+    },
+    {
+      "epoch": 3.535552193645991,
+      "grad_norm": 1.033564567565918,
+      "learning_rate": 6.536051394018702e-06,
+      "loss": 0.128,
+      "step": 1170
+    },
+    {
+      "epoch": 3.550680786686838,
+      "grad_norm": 1.1593022346496582,
+      "learning_rate": 6.412731582335146e-06,
+      "loss": 0.1196,
+      "step": 1175
+    },
+    {
+      "epoch": 3.5658093797276855,
+      "grad_norm": 1.0917978286743164,
+      "learning_rate": 6.290269175816268e-06,
+      "loss": 0.113,
+      "step": 1180
+    },
+    {
+      "epoch": 3.5809379727685324,
+      "grad_norm": 1.2841700315475464,
+      "learning_rate": 6.168676401854357e-06,
+      "loss": 0.1239,
+      "step": 1185
+    },
+    {
+      "epoch": 3.59606656580938,
+      "grad_norm": 1.0042773485183716,
+      "learning_rate": 6.047965401012324e-06,
+      "loss": 0.118,
+      "step": 1190
+    },
+    {
+      "epoch": 3.6111951588502267,
+      "grad_norm": 0.8104374408721924,
+      "learning_rate": 5.92814822581147e-06,
+      "loss": 0.1183,
+      "step": 1195
+    },
+    {
+      "epoch": 3.626323751891074,
+      "grad_norm": 1.1587086915969849,
+      "learning_rate": 5.809236839528115e-06,
+      "loss": 0.1165,
+      "step": 1200
+    },
+    {
+      "epoch": 3.6414523449319214,
+      "grad_norm": 1.0012660026550293,
+      "learning_rate": 5.6912431149990704e-06,
+      "loss": 0.111,
+      "step": 1205
+    },
+    {
+      "epoch": 3.6565809379727687,
+      "grad_norm": 0.9174544215202332,
+      "learning_rate": 5.5741788334362505e-06,
+      "loss": 0.111,
+      "step": 1210
+    },
+    {
+      "epoch": 3.6717095310136156,
+      "grad_norm": 0.8950276374816895,
+      "learning_rate": 5.458055683250288e-06,
+      "loss": 0.1189,
+      "step": 1215
+    },
+    {
+      "epoch": 3.686838124054463,
+      "grad_norm": 0.8682631850242615,
+      "learning_rate": 5.342885258883548e-06,
+      "loss": 0.1212,
+      "step": 1220
+    },
+    {
+      "epoch": 3.70196671709531,
+      "grad_norm": 1.0850940942764282,
+      "learning_rate": 5.228679059652446e-06,
+      "loss": 0.1148,
+      "step": 1225
+    },
+    {
+      "epoch": 3.7170953101361572,
+      "grad_norm": 1.039988398551941,
+      "learning_rate": 5.115448488599287e-06,
+      "loss": 0.1206,
+      "step": 1230
+    },
+    {
+      "epoch": 3.7322239031770046,
+      "grad_norm": 0.8652891516685486,
+      "learning_rate": 5.003204851353719e-06,
+      "loss": 0.1115,
+      "step": 1235
+    },
+    {
+      "epoch": 3.747352496217852,
+      "grad_norm": 0.9648194313049316,
+      "learning_rate": 4.891959355003916e-06,
+      "loss": 0.1104,
+      "step": 1240
+    },
+    {
+      "epoch": 3.762481089258699,
+      "grad_norm": 0.8316813111305237,
+      "learning_rate": 4.781723106977564e-06,
+      "loss": 0.1154,
+      "step": 1245
+    },
+    {
+      "epoch": 3.777609682299546,
+      "grad_norm": 0.9703608155250549,
+      "learning_rate": 4.672507113932888e-06,
+      "loss": 0.1208,
+      "step": 1250
+    },
+    {
+      "epoch": 3.792738275340393,
+      "grad_norm": 0.9646096229553223,
+      "learning_rate": 4.564322280659612e-06,
+      "loss": 0.1031,
+      "step": 1255
+    },
+    {
+      "epoch": 3.8078668683812404,
+      "grad_norm": 0.8670827746391296,
+      "learning_rate": 4.457179408990203e-06,
+      "loss": 0.104,
+      "step": 1260
+    },
+    {
+      "epoch": 3.822995461422088,
+      "grad_norm": 0.9630376696586609,
+      "learning_rate": 4.35108919672134e-06,
+      "loss": 0.1069,
+      "step": 1265
+    },
+    {
+      "epoch": 3.838124054462935,
+      "grad_norm": 0.9716989994049072,
+      "learning_rate": 4.246062236545771e-06,
+      "loss": 0.1154,
+      "step": 1270
+    },
+    {
+      "epoch": 3.853252647503782,
+      "grad_norm": 0.8012140989303589,
+      "learning_rate": 4.142109014994685e-06,
+      "loss": 0.1086,
+      "step": 1275
+    },
+    {
+      "epoch": 3.8683812405446294,
+      "grad_norm": 0.9220558404922485,
+      "learning_rate": 4.0392399113906735e-06,
+      "loss": 0.1132,
+      "step": 1280
+    },
+    {
+      "epoch": 3.8835098335854763,
+      "grad_norm": 0.9726728796958923,
+      "learning_rate": 3.937465196811375e-06,
+      "loss": 0.1108,
+      "step": 1285
+    },
+    {
+      "epoch": 3.8986384266263236,
+      "grad_norm": 0.9615852236747742,
+      "learning_rate": 3.836795033063982e-06,
+      "loss": 0.1134,
+      "step": 1290
+    },
+    {
+      "epoch": 3.913767019667171,
+      "grad_norm": 0.8726525902748108,
+      "learning_rate": 3.7372394716706e-06,
+      "loss": 0.1106,
+      "step": 1295
+    },
+    {
+      "epoch": 3.9288956127080183,
+      "grad_norm": 1.0038763284683228,
+      "learning_rate": 3.638808452864646e-06,
+      "loss": 0.1073,
+      "step": 1300
+    },
+    {
+      "epoch": 3.9440242057488653,
+      "grad_norm": 0.9384374618530273,
+      "learning_rate": 3.5415118045983635e-06,
+      "loss": 0.1091,
+      "step": 1305
+    },
+    {
+      "epoch": 3.9591527987897126,
+      "grad_norm": 0.9748557209968567,
+      "learning_rate": 3.4453592415615336e-06,
+      "loss": 0.1147,
+      "step": 1310
+    },
+    {
+      "epoch": 3.9742813918305595,
+      "grad_norm": 0.939871072769165,
+      "learning_rate": 3.350360364211494e-06,
+      "loss": 0.1132,
+      "step": 1315
+    },
+    {
+      "epoch": 3.989409984871407,
+      "grad_norm": 0.8852137327194214,
+      "learning_rate": 3.256524657814588e-06,
+      "loss": 0.1022,
+      "step": 1320
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1655,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.9866744375748854e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

20_128_e5_3e-5/checkpoint-1324/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2fdd8e6403af008b87810dc8e1147b217350447ed548cf0396b8b43910b391a1
+size 7736

20_128_e5_3e-5/checkpoint-1324/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1324/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

20_128_e5_3e-5/checkpoint-1655/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

20_128_e5_3e-5/checkpoint-1655/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "v_proj",
+    "down_proj",
+    "o_proj",
+    "q_proj",
+    "k_proj",
+    "gate_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

20_128_e5_3e-5/checkpoint-1655/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc5dd01d04f16b5d91aad09b6e54b32c700075d39d76e490f35f5e1b5e537885
+size 791751704

20_128_e5_3e-5/checkpoint-1655/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1655

20_128_e5_3e-5/checkpoint-1655/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1655/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de6c4e745884d96adb700b432acb6450c6a99c1ec1da087d95cbc448d3d9ec05
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:08e334d229ae6c2e761818d7eba6a6621fbce77634943310f02a01e960b70b09
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e71f24da284038a969478ffae452ab5c5daf51773064583339eb6a9d9b3b1dfa
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8128eccdeda937496f288347c0300d652afdf937b875f8aa21cdb17bd35e7c40
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7a9a1050c3a45f0960632a36fa1c23bcfb831f2fcbfa5a4e0f024cdb995067b
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83f08a49ef21cc3e78d928d21a5b158371e8c7ff749a74bcbc2d7272e5f2a243
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6cc16f7e417f07e28ce4e316ed669da9c42a3ee270740113c19645b327d06310
+size 15920

20_128_e5_3e-5/checkpoint-1655/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ebab7ec50e1178ec5b79e07db8d77e197aa84eb404eacad7b42f45b1b6646b55
+size 15920

20_128_e5_3e-5/checkpoint-1655/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0027125075a50851d0eed0ecca1a700093c7b727f120a5dd75a9408fff6ead1d
+size 1064

20_128_e5_3e-5/checkpoint-1655/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

20_128_e5_3e-5/checkpoint-1655/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1655/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

20_128_e5_3e-5/checkpoint-1655/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2351 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 5.0,
+  "eval_steps": 500,
+  "global_step": 1655,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.015128593040847202,
+      "grad_norm": 1.1886738538742065,
+      "learning_rate": 1.4457831325301207e-06,
+      "loss": 1.2563,
+      "step": 5
+    },
+    {
+      "epoch": 0.030257186081694403,
+      "grad_norm": 0.9057577848434448,
+      "learning_rate": 3.2530120481927713e-06,
+      "loss": 1.3015,
+      "step": 10
+    },
+    {
+      "epoch": 0.0453857791225416,
+      "grad_norm": 0.6479454040527344,
+      "learning_rate": 5.060240963855422e-06,
+      "loss": 1.2638,
+      "step": 15
+    },
+    {
+      "epoch": 0.060514372163388806,
+      "grad_norm": 0.5733408331871033,
+      "learning_rate": 6.867469879518072e-06,
+      "loss": 1.2674,
+      "step": 20
+    },
+    {
+      "epoch": 0.07564296520423601,
+      "grad_norm": 0.4907199740409851,
+      "learning_rate": 8.674698795180722e-06,
+      "loss": 1.2763,
+      "step": 25
+    },
+    {
+      "epoch": 0.0907715582450832,
+      "grad_norm": 0.5530977845191956,
+      "learning_rate": 1.0481927710843374e-05,
+      "loss": 1.2184,
+      "step": 30
+    },
+    {
+      "epoch": 0.1059001512859304,
+      "grad_norm": 0.568281352519989,
+      "learning_rate": 1.2289156626506024e-05,
+      "loss": 1.2443,
+      "step": 35
+    },
+    {
+      "epoch": 0.12102874432677761,
+      "grad_norm": 0.7138775587081909,
+      "learning_rate": 1.4096385542168676e-05,
+      "loss": 1.2258,
+      "step": 40
+    },
+    {
+      "epoch": 0.1361573373676248,
+      "grad_norm": 0.4135199189186096,
+      "learning_rate": 1.5903614457831326e-05,
+      "loss": 1.1615,
+      "step": 45
+    },
+    {
+      "epoch": 0.15128593040847202,
+      "grad_norm": 0.4403356909751892,
+      "learning_rate": 1.7710843373493978e-05,
+      "loss": 1.2148,
+      "step": 50
+    },
+    {
+      "epoch": 0.1664145234493192,
+      "grad_norm": 0.4433908462524414,
+      "learning_rate": 1.9518072289156627e-05,
+      "loss": 1.1543,
+      "step": 55
+    },
+    {
+      "epoch": 0.1815431164901664,
+      "grad_norm": 0.4714948832988739,
+      "learning_rate": 2.1325301204819275e-05,
+      "loss": 1.176,
+      "step": 60
+    },
+    {
+      "epoch": 0.19667170953101362,
+      "grad_norm": 0.47108614444732666,
+      "learning_rate": 2.313253012048193e-05,
+      "loss": 1.2043,
+      "step": 65
+    },
+    {
+      "epoch": 0.2118003025718608,
+      "grad_norm": 0.40567493438720703,
+      "learning_rate": 2.493975903614458e-05,
+      "loss": 1.1462,
+      "step": 70
+    },
+    {
+      "epoch": 0.22692889561270801,
+      "grad_norm": 0.4979853928089142,
+      "learning_rate": 2.674698795180723e-05,
+      "loss": 1.1362,
+      "step": 75
+    },
+    {
+      "epoch": 0.24205748865355523,
+      "grad_norm": 0.4744911193847656,
+      "learning_rate": 2.855421686746988e-05,
+      "loss": 1.1297,
+      "step": 80
+    },
+    {
+      "epoch": 0.25718608169440244,
+      "grad_norm": 0.48895782232284546,
+      "learning_rate": 2.9999970045934107e-05,
+      "loss": 1.0646,
+      "step": 85
+    },
+    {
+      "epoch": 0.2723146747352496,
+      "grad_norm": 0.5125440955162048,
+      "learning_rate": 2.999892166618921e-05,
+      "loss": 1.0775,
+      "step": 90
+    },
+    {
+      "epoch": 0.2874432677760968,
+      "grad_norm": 0.5728165507316589,
+      "learning_rate": 2.999637570278028e-05,
+      "loss": 1.1223,
+      "step": 95
+    },
+    {
+      "epoch": 0.30257186081694404,
+      "grad_norm": 0.5014736652374268,
+      "learning_rate": 2.999233240991181e-05,
+      "loss": 1.0431,
+      "step": 100
+    },
+    {
+      "epoch": 0.3177004538577912,
+      "grad_norm": 0.4918844699859619,
+      "learning_rate": 2.9986792191290767e-05,
+      "loss": 1.0938,
+      "step": 105
+    },
+    {
+      "epoch": 0.3328290468986384,
+      "grad_norm": 0.5097721219062805,
+      "learning_rate": 2.9979755600086323e-05,
+      "loss": 1.0524,
+      "step": 110
+    },
+    {
+      "epoch": 0.34795763993948564,
+      "grad_norm": 0.6224564909934998,
+      "learning_rate": 2.9971223338874577e-05,
+      "loss": 0.975,
+      "step": 115
+    },
+    {
+      "epoch": 0.3630862329803328,
+      "grad_norm": 0.6176092028617859,
+      "learning_rate": 2.996119625956845e-05,
+      "loss": 1.0153,
+      "step": 120
+    },
+    {
+      "epoch": 0.37821482602118,
+      "grad_norm": 0.5799441933631897,
+      "learning_rate": 2.994967536333258e-05,
+      "loss": 0.9895,
+      "step": 125
+    },
+    {
+      "epoch": 0.39334341906202724,
+      "grad_norm": 0.5250234007835388,
+      "learning_rate": 2.993666180048341e-05,
+      "loss": 1.0104,
+      "step": 130
+    },
+    {
+      "epoch": 0.4084720121028744,
+      "grad_norm": 0.6541566848754883,
+      "learning_rate": 2.992215687037428e-05,
+      "loss": 1.0159,
+      "step": 135
+    },
+    {
+      "epoch": 0.4236006051437216,
+      "grad_norm": 0.6040735244750977,
+      "learning_rate": 2.9906162021265736e-05,
+      "loss": 1.0339,
+      "step": 140
+    },
+    {
+      "epoch": 0.43872919818456885,
+      "grad_norm": 0.846281111240387,
+      "learning_rate": 2.9888678850180895e-05,
+      "loss": 0.9737,
+      "step": 145
+    },
+    {
+      "epoch": 0.45385779122541603,
+      "grad_norm": 0.6770375967025757,
+      "learning_rate": 2.986970910274601e-05,
+      "loss": 0.9342,
+      "step": 150
+    },
+    {
+      "epoch": 0.4689863842662632,
+      "grad_norm": 0.7691357135772705,
+      "learning_rate": 2.9849254673016178e-05,
+      "loss": 0.965,
+      "step": 155
+    },
+    {
+      "epoch": 0.48411497730711045,
+      "grad_norm": 0.608444094657898,
+      "learning_rate": 2.982731760328619e-05,
+      "loss": 0.9331,
+      "step": 160
+    },
+    {
+      "epoch": 0.49924357034795763,
+      "grad_norm": 0.7073132991790771,
+      "learning_rate": 2.980390008388667e-05,
+      "loss": 0.9161,
+      "step": 165
+    },
+    {
+      "epoch": 0.5143721633888049,
+      "grad_norm": 0.6664533615112305,
+      "learning_rate": 2.977900445296533e-05,
+      "loss": 0.9635,
+      "step": 170
+    },
+    {
+      "epoch": 0.529500756429652,
+      "grad_norm": 0.7164839506149292,
+      "learning_rate": 2.975263319625355e-05,
+      "loss": 0.9521,
+      "step": 175
+    },
+    {
+      "epoch": 0.5446293494704992,
+      "grad_norm": 0.7335389256477356,
+      "learning_rate": 2.972478894681817e-05,
+      "loss": 0.9104,
+      "step": 180
+    },
+    {
+      "epoch": 0.5597579425113465,
+      "grad_norm": 0.7356768250465393,
+      "learning_rate": 2.9695474484798582e-05,
+      "loss": 0.8796,
+      "step": 185
+    },
+    {
+      "epoch": 0.5748865355521936,
+      "grad_norm": 0.7879706621170044,
+      "learning_rate": 2.966469273712917e-05,
+      "loss": 0.9024,
+      "step": 190
+    },
+    {
+      "epoch": 0.5900151285930408,
+      "grad_norm": 0.8149765133857727,
+      "learning_rate": 2.9632446777247045e-05,
+      "loss": 0.8684,
+      "step": 195
+    },
+    {
+      "epoch": 0.6051437216338881,
+      "grad_norm": 0.718940794467926,
+      "learning_rate": 2.959873982478517e-05,
+      "loss": 0.9132,
+      "step": 200
+    },
+    {
+      "epoch": 0.6202723146747352,
+      "grad_norm": 0.7738745212554932,
+      "learning_rate": 2.956357524525093e-05,
+      "loss": 0.8179,
+      "step": 205
+    },
+    {
+      "epoch": 0.6354009077155824,
+      "grad_norm": 0.7994048595428467,
+      "learning_rate": 2.9526956549690037e-05,
+      "loss": 0.8375,
+      "step": 210
+    },
+    {
+      "epoch": 0.6505295007564297,
+      "grad_norm": 0.7944709062576294,
+      "learning_rate": 2.9488887394336025e-05,
+      "loss": 0.8244,
+      "step": 215
+    },
+    {
+      "epoch": 0.6656580937972768,
+      "grad_norm": 0.8380885124206543,
+      "learning_rate": 2.9449371580245162e-05,
+      "loss": 0.7798,
+      "step": 220
+    },
+    {
+      "epoch": 0.680786686838124,
+      "grad_norm": 0.8535066246986389,
+      "learning_rate": 2.9408413052916923e-05,
+      "loss": 0.8358,
+      "step": 225
+    },
+    {
+      "epoch": 0.6959152798789713,
+      "grad_norm": 0.8346946835517883,
+      "learning_rate": 2.9366015901900066e-05,
+      "loss": 0.8022,
+      "step": 230
+    },
+    {
+      "epoch": 0.7110438729198184,
+      "grad_norm": 0.840814471244812,
+      "learning_rate": 2.9322184360384297e-05,
+      "loss": 0.8327,
+      "step": 235
+    },
+    {
+      "epoch": 0.7261724659606656,
+      "grad_norm": 0.9543588757514954,
+      "learning_rate": 2.92769228047776e-05,
+      "loss": 0.7916,
+      "step": 240
+    },
+    {
+      "epoch": 0.7413010590015129,
+      "grad_norm": 0.9211695194244385,
+      "learning_rate": 2.923023575426927e-05,
+      "loss": 0.8282,
+      "step": 245
+    },
+    {
+      "epoch": 0.75642965204236,
+      "grad_norm": 0.9559474587440491,
+      "learning_rate": 2.91821278703787e-05,
+      "loss": 0.7517,
+      "step": 250
+    },
+    {
+      "epoch": 0.7715582450832073,
+      "grad_norm": 0.9239177107810974,
+      "learning_rate": 2.9132603956489934e-05,
+      "loss": 0.7995,
+      "step": 255
+    },
+    {
+      "epoch": 0.7866868381240545,
+      "grad_norm": 0.8963080048561096,
+      "learning_rate": 2.9081668957372072e-05,
+      "loss": 0.7688,
+      "step": 260
+    },
+    {
+      "epoch": 0.8018154311649016,
+      "grad_norm": 0.9856358766555786,
+      "learning_rate": 2.902932795868556e-05,
+      "loss": 0.8155,
+      "step": 265
+    },
+    {
+      "epoch": 0.8169440242057489,
+      "grad_norm": 0.8115594387054443,
+      "learning_rate": 2.8975586186474397e-05,
+      "loss": 0.7465,
+      "step": 270
+    },
+    {
+      "epoch": 0.8320726172465961,
+      "grad_norm": 0.8640703558921814,
+      "learning_rate": 2.8920449006644346e-05,
+      "loss": 0.7582,
+      "step": 275
+    },
+    {
+      "epoch": 0.8472012102874432,
+      "grad_norm": 0.8925158977508545,
+      "learning_rate": 2.886392192442715e-05,
+      "loss": 0.7413,
+      "step": 280
+    },
+    {
+      "epoch": 0.8623298033282905,
+      "grad_norm": 0.9191139340400696,
+      "learning_rate": 2.8806010583830885e-05,
+      "loss": 0.7536,
+      "step": 285
+    },
+    {
+      "epoch": 0.8774583963691377,
+      "grad_norm": 0.9398569464683533,
+      "learning_rate": 2.8746720767076403e-05,
+      "loss": 0.6988,
+      "step": 290
+    },
+    {
+      "epoch": 0.8925869894099848,
+      "grad_norm": 1.0098797082901,
+      "learning_rate": 2.8686058394020006e-05,
+      "loss": 0.7048,
+      "step": 295
+    },
+    {
+      "epoch": 0.9077155824508321,
+      "grad_norm": 0.9016486406326294,
+      "learning_rate": 2.862402952156238e-05,
+      "loss": 0.6867,
+      "step": 300
+    },
+    {
+      "epoch": 0.9228441754916793,
+      "grad_norm": 0.9746642708778381,
+      "learning_rate": 2.856064034304384e-05,
+      "loss": 0.6887,
+      "step": 305
+    },
+    {
+      "epoch": 0.9379727685325264,
+      "grad_norm": 0.9909067749977112,
+      "learning_rate": 2.849589718762592e-05,
+      "loss": 0.7016,
+      "step": 310
+    },
+    {
+      "epoch": 0.9531013615733737,
+      "grad_norm": 1.0518617630004883,
+      "learning_rate": 2.8429806519659463e-05,
+      "loss": 0.7014,
+      "step": 315
+    },
+    {
+      "epoch": 0.9682299546142209,
+      "grad_norm": 1.1244536638259888,
+      "learning_rate": 2.8362374938039174e-05,
+      "loss": 0.6708,
+      "step": 320
+    },
+    {
+      "epoch": 0.983358547655068,
+      "grad_norm": 1.0927412509918213,
+      "learning_rate": 2.8293609175544737e-05,
+      "loss": 0.6681,
+      "step": 325
+    },
+    {
+      "epoch": 0.9984871406959153,
+      "grad_norm": 0.940122127532959,
+      "learning_rate": 2.8223516098168573e-05,
+      "loss": 0.6809,
+      "step": 330
+    },
+    {
+      "epoch": 1.0121028744326777,
+      "grad_norm": 1.1608147621154785,
+      "learning_rate": 2.8152102704430312e-05,
+      "loss": 0.6216,
+      "step": 335
+    },
+    {
+      "epoch": 1.027231467473525,
+      "grad_norm": 1.0822938680648804,
+      "learning_rate": 2.8079376124678e-05,
+      "loss": 0.602,
+      "step": 340
+    },
+    {
+      "epoch": 1.0423600605143721,
+      "grad_norm": 1.149957537651062,
+      "learning_rate": 2.800534362037618e-05,
+      "loss": 0.5483,
+      "step": 345
+    },
+    {
+      "epoch": 1.0574886535552193,
+      "grad_norm": 1.0549287796020508,
+      "learning_rate": 2.793001258338084e-05,
+      "loss": 0.5477,
+      "step": 350
+    },
+    {
+      "epoch": 1.0726172465960666,
+      "grad_norm": 1.0644642114639282,
+      "learning_rate": 2.7853390535201396e-05,
+      "loss": 0.5725,
+      "step": 355
+    },
+    {
+      "epoch": 1.0877458396369137,
+      "grad_norm": 1.1603634357452393,
+      "learning_rate": 2.7775485126249665e-05,
+      "loss": 0.5744,
+      "step": 360
+    },
+    {
+      "epoch": 1.102874432677761,
+      "grad_norm": 1.0989371538162231,
+      "learning_rate": 2.7696304135076024e-05,
+      "loss": 0.5413,
+      "step": 365
+    },
+    {
+      "epoch": 1.1180030257186082,
+      "grad_norm": 1.0977047681808472,
+      "learning_rate": 2.7615855467592756e-05,
+      "loss": 0.6023,
+      "step": 370
+    },
+    {
+      "epoch": 1.1331316187594553,
+      "grad_norm": 1.0698915719985962,
+      "learning_rate": 2.753414715628464e-05,
+      "loss": 0.5721,
+      "step": 375
+    },
+    {
+      "epoch": 1.1482602118003027,
+      "grad_norm": 1.1103595495224,
+      "learning_rate": 2.745118735940699e-05,
+      "loss": 0.5445,
+      "step": 380
+    },
+    {
+      "epoch": 1.1633888048411498,
+      "grad_norm": 1.0626442432403564,
+      "learning_rate": 2.7366984360171047e-05,
+      "loss": 0.5558,
+      "step": 385
+    },
+    {
+      "epoch": 1.178517397881997,
+      "grad_norm": 1.076282024383545,
+      "learning_rate": 2.7281546565916948e-05,
+      "loss": 0.487,
+      "step": 390
+    },
+    {
+      "epoch": 1.1936459909228443,
+      "grad_norm": 1.100417971611023,
+      "learning_rate": 2.719488250727427e-05,
+      "loss": 0.5346,
+      "step": 395
+    },
+    {
+      "epoch": 1.2087745839636914,
+      "grad_norm": 1.079980731010437,
+      "learning_rate": 2.710700083731032e-05,
+      "loss": 0.507,
+      "step": 400
+    },
+    {
+      "epoch": 1.2239031770045385,
+      "grad_norm": 1.086501955986023,
+      "learning_rate": 2.701791033066612e-05,
+      "loss": 0.5573,
+      "step": 405
+    },
+    {
+      "epoch": 1.239031770045386,
+      "grad_norm": 1.3494257926940918,
+      "learning_rate": 2.6927619882680286e-05,
+      "loss": 0.5116,
+      "step": 410
+    },
+    {
+      "epoch": 1.254160363086233,
+      "grad_norm": 1.0927280187606812,
+      "learning_rate": 2.6836138508500918e-05,
+      "loss": 0.5368,
+      "step": 415
+    },
+    {
+      "epoch": 1.2692889561270801,
+      "grad_norm": 1.0733301639556885,
+      "learning_rate": 2.6743475342185414e-05,
+      "loss": 0.5352,
+      "step": 420
+    },
+    {
+      "epoch": 1.2844175491679275,
+      "grad_norm": 1.0263490676879883,
+      "learning_rate": 2.664963963578851e-05,
+      "loss": 0.5274,
+      "step": 425
+    },
+    {
+      "epoch": 1.2995461422087746,
+      "grad_norm": 1.1017117500305176,
+      "learning_rate": 2.655464075843847e-05,
+      "loss": 0.5247,
+      "step": 430
+    },
+    {
+      "epoch": 1.3146747352496218,
+      "grad_norm": 1.308171033859253,
+      "learning_rate": 2.6458488195401636e-05,
+      "loss": 0.534,
+      "step": 435
+    },
+    {
+      "epoch": 1.329803328290469,
+      "grad_norm": 1.0724010467529297,
+      "learning_rate": 2.6361191547135355e-05,
+      "loss": 0.4958,
+      "step": 440
+    },
+    {
+      "epoch": 1.3449319213313162,
+      "grad_norm": 1.0588688850402832,
+      "learning_rate": 2.62627605283294e-05,
+      "loss": 0.4838,
+      "step": 445
+    },
+    {
+      "epoch": 1.3600605143721634,
+      "grad_norm": 1.0089439153671265,
+      "learning_rate": 2.6163204966936022e-05,
+      "loss": 0.5041,
+      "step": 450
+    },
+    {
+      "epoch": 1.3751891074130107,
+      "grad_norm": 1.1149275302886963,
+      "learning_rate": 2.6062534803188628e-05,
+      "loss": 0.4661,
+      "step": 455
+    },
+    {
+      "epoch": 1.3903177004538578,
+      "grad_norm": 1.2650748491287231,
+      "learning_rate": 2.596076008860933e-05,
+      "loss": 0.4994,
+      "step": 460
+    },
+    {
+      "epoch": 1.405446293494705,
+      "grad_norm": 1.2204926013946533,
+      "learning_rate": 2.5857890985005315e-05,
+      "loss": 0.4399,
+      "step": 465
+    },
+    {
+      "epoch": 1.4205748865355523,
+      "grad_norm": 1.078391194343567,
+      "learning_rate": 2.5753937763454233e-05,
+      "loss": 0.5202,
+      "step": 470
+    },
+    {
+      "epoch": 1.4357034795763994,
+      "grad_norm": 1.0975085496902466,
+      "learning_rate": 2.5648910803278662e-05,
+      "loss": 0.5036,
+      "step": 475
+    },
+    {
+      "epoch": 1.4508320726172466,
+      "grad_norm": 1.299043893814087,
+      "learning_rate": 2.55428205910098e-05,
+      "loss": 0.495,
+      "step": 480
+    },
+    {
+      "epoch": 1.465960665658094,
+      "grad_norm": 1.2129578590393066,
+      "learning_rate": 2.543567771934039e-05,
+      "loss": 0.5012,
+      "step": 485
+    },
+    {
+      "epoch": 1.481089258698941,
+      "grad_norm": 1.2743144035339355,
+      "learning_rate": 2.5327492886067115e-05,
+      "loss": 0.5098,
+      "step": 490
+    },
+    {
+      "epoch": 1.4962178517397882,
+      "grad_norm": 1.2529656887054443,
+      "learning_rate": 2.5218276893022435e-05,
+      "loss": 0.4847,
+      "step": 495
+    },
+    {
+      "epoch": 1.5113464447806355,
+      "grad_norm": 1.0965994596481323,
+      "learning_rate": 2.5108040644996087e-05,
+      "loss": 0.4372,
+      "step": 500
+    },
+    {
+      "epoch": 1.5264750378214826,
+      "grad_norm": 1.1192255020141602,
+      "learning_rate": 2.4996795148646283e-05,
+      "loss": 0.4632,
+      "step": 505
+    },
+    {
+      "epoch": 1.5416036308623298,
+      "grad_norm": 1.0835782289505005,
+      "learning_rate": 2.4884551511400714e-05,
+      "loss": 0.4551,
+      "step": 510
+    },
+    {
+      "epoch": 1.5567322239031771,
+      "grad_norm": 1.0586811304092407,
+      "learning_rate": 2.4771320940347554e-05,
+      "loss": 0.439,
+      "step": 515
+    },
+    {
+      "epoch": 1.5718608169440242,
+      "grad_norm": 1.0878212451934814,
+      "learning_rate": 2.4657114741116458e-05,
+      "loss": 0.4175,
+      "step": 520
+    },
+    {
+      "epoch": 1.5869894099848714,
+      "grad_norm": 1.1067463159561157,
+      "learning_rate": 2.454194431674972e-05,
+      "loss": 0.4371,
+      "step": 525
+    },
+    {
+      "epoch": 1.6021180030257187,
+      "grad_norm": 1.3756803274154663,
+      "learning_rate": 2.4425821166563757e-05,
+      "loss": 0.4379,
+      "step": 530
+    },
+    {
+      "epoch": 1.6172465960665658,
+      "grad_norm": 1.1154321432113647,
+      "learning_rate": 2.4308756885000928e-05,
+      "loss": 0.4382,
+      "step": 535
+    },
+    {
+      "epoch": 1.632375189107413,
+      "grad_norm": 1.2593263387680054,
+      "learning_rate": 2.419076316047189e-05,
+      "loss": 0.4171,
+      "step": 540
+    },
+    {
+      "epoch": 1.6475037821482603,
+      "grad_norm": 1.035703420639038,
+      "learning_rate": 2.407185177418853e-05,
+      "loss": 0.4212,
+      "step": 545
+    },
+    {
+      "epoch": 1.6626323751891074,
+      "grad_norm": 1.7641640901565552,
+      "learning_rate": 2.3952034598987677e-05,
+      "loss": 0.4193,
+      "step": 550
+    },
+    {
+      "epoch": 1.6777609682299546,
+      "grad_norm": 1.1478875875473022,
+      "learning_rate": 2.3831323598145644e-05,
+      "loss": 0.4529,
+      "step": 555
+    },
+    {
+      "epoch": 1.692889561270802,
+      "grad_norm": 1.028493046760559,
+      "learning_rate": 2.370973082418374e-05,
+      "loss": 0.4165,
+      "step": 560
+    },
+    {
+      "epoch": 1.708018154311649,
+      "grad_norm": 1.1380536556243896,
+      "learning_rate": 2.3587268417664848e-05,
+      "loss": 0.4264,
+      "step": 565
+    },
+    {
+      "epoch": 1.7231467473524962,
+      "grad_norm": 1.0838291645050049,
+      "learning_rate": 2.34639486059813e-05,
+      "loss": 0.4264,
+      "step": 570
+    },
+    {
+      "epoch": 1.7382753403933435,
+      "grad_norm": 1.1299512386322021,
+      "learning_rate": 2.3339783702133955e-05,
+      "loss": 0.4051,
+      "step": 575
+    },
+    {
+      "epoch": 1.7534039334341907,
+      "grad_norm": 1.0154179334640503,
+      "learning_rate": 2.321478610350282e-05,
+      "loss": 0.3535,
+      "step": 580
+    },
+    {
+      "epoch": 1.7685325264750378,
+      "grad_norm": 1.2361255884170532,
+      "learning_rate": 2.3088968290609223e-05,
+      "loss": 0.3816,
+      "step": 585
+    },
+    {
+      "epoch": 1.7836611195158851,
+      "grad_norm": 1.1661337614059448,
+      "learning_rate": 2.2962342825869684e-05,
+      "loss": 0.3797,
+      "step": 590
+    },
+    {
+      "epoch": 1.7987897125567323,
+      "grad_norm": 1.0582844018936157,
+      "learning_rate": 2.2834922352341587e-05,
+      "loss": 0.3807,
+      "step": 595
+    },
+    {
+      "epoch": 1.8139183055975794,
+      "grad_norm": 1.287212610244751,
+      "learning_rate": 2.2706719592460843e-05,
+      "loss": 0.4034,
+      "step": 600
+    },
+    {
+      "epoch": 1.8290468986384267,
+      "grad_norm": 1.263323187828064,
+      "learning_rate": 2.25777473467716e-05,
+      "loss": 0.4052,
+      "step": 605
+    },
+    {
+      "epoch": 1.8441754916792739,
+      "grad_norm": 0.9774038195610046,
+      "learning_rate": 2.2448018492648147e-05,
+      "loss": 0.3838,
+      "step": 610
+    },
+    {
+      "epoch": 1.859304084720121,
+      "grad_norm": 1.2147777080535889,
+      "learning_rate": 2.2317545983009166e-05,
+      "loss": 0.4093,
+      "step": 615
+    },
+    {
+      "epoch": 1.8744326777609683,
+      "grad_norm": 1.171796202659607,
+      "learning_rate": 2.218634284502444e-05,
+      "loss": 0.3878,
+      "step": 620
+    },
+    {
+      "epoch": 1.8895612708018155,
+      "grad_norm": 1.2053415775299072,
+      "learning_rate": 2.205442217881412e-05,
+      "loss": 0.3768,
+      "step": 625
+    },
+    {
+      "epoch": 1.9046898638426626,
+      "grad_norm": 1.192639946937561,
+      "learning_rate": 2.192179715614077e-05,
+      "loss": 0.398,
+      "step": 630
+    },
+    {
+      "epoch": 1.91981845688351,
+      "grad_norm": 1.2259142398834229,
+      "learning_rate": 2.1788481019094164e-05,
+      "loss": 0.3821,
+      "step": 635
+    },
+    {
+      "epoch": 1.934947049924357,
+      "grad_norm": 1.2637989521026611,
+      "learning_rate": 2.165448707876916e-05,
+      "loss": 0.3581,
+      "step": 640
+    },
+    {
+      "epoch": 1.9500756429652042,
+      "grad_norm": 1.0719422101974487,
+      "learning_rate": 2.1519828713936614e-05,
+      "loss": 0.3681,
+      "step": 645
+    },
+    {
+      "epoch": 1.9652042360060515,
+      "grad_norm": 1.1670970916748047,
+      "learning_rate": 2.138451936970757e-05,
+      "loss": 0.3546,
+      "step": 650
+    },
+    {
+      "epoch": 1.9803328290468987,
+      "grad_norm": 1.2138304710388184,
+      "learning_rate": 2.1248572556190837e-05,
+      "loss": 0.3797,
+      "step": 655
+    },
+    {
+      "epoch": 1.9954614220877458,
+      "grad_norm": 1.2247320413589478,
+      "learning_rate": 2.1112001847144013e-05,
+      "loss": 0.299,
+      "step": 660
+    },
+    {
+      "epoch": 2.0090771558245084,
+      "grad_norm": 1.101378083229065,
+      "learning_rate": 2.097482087861824e-05,
+      "loss": 0.3227,
+      "step": 665
+    },
+    {
+      "epoch": 2.0242057488653553,
+      "grad_norm": 1.1532782316207886,
+      "learning_rate": 2.0837043347596675e-05,
+      "loss": 0.255,
+      "step": 670
+    },
+    {
+      "epoch": 2.0393343419062027,
+      "grad_norm": 1.2093374729156494,
+      "learning_rate": 2.069868301062691e-05,
+      "loss": 0.2942,
+      "step": 675
+    },
+    {
+      "epoch": 2.05446293494705,
+      "grad_norm": 1.0964783430099487,
+      "learning_rate": 2.0559753682447436e-05,
+      "loss": 0.2824,
+      "step": 680
+    },
+    {
+      "epoch": 2.069591527987897,
+      "grad_norm": 1.0570346117019653,
+      "learning_rate": 2.0420269234608282e-05,
+      "loss": 0.2542,
+      "step": 685
+    },
+    {
+      "epoch": 2.0847201210287443,
+      "grad_norm": 1.1446951627731323,
+      "learning_rate": 2.0280243594086013e-05,
+      "loss": 0.2747,
+      "step": 690
+    },
+    {
+      "epoch": 2.0998487140695916,
+      "grad_norm": 1.475313425064087,
+      "learning_rate": 2.0139690741893152e-05,
+      "loss": 0.2992,
+      "step": 695
+    },
+    {
+      "epoch": 2.1149773071104385,
+      "grad_norm": 1.0060259103775024,
+      "learning_rate": 1.999862471168226e-05,
+      "loss": 0.2619,
+      "step": 700
+    },
+    {
+      "epoch": 2.130105900151286,
+      "grad_norm": 1.215386986732483,
+      "learning_rate": 1.985705958834471e-05,
+      "loss": 0.2676,
+      "step": 705
+    },
+    {
+      "epoch": 2.145234493192133,
+      "grad_norm": 1.0537375211715698,
+      "learning_rate": 1.9715009506604383e-05,
+      "loss": 0.2947,
+      "step": 710
+    },
+    {
+      "epoch": 2.16036308623298,
+      "grad_norm": 1.187626838684082,
+      "learning_rate": 1.9572488649606335e-05,
+      "loss": 0.2658,
+      "step": 715
+    },
+    {
+      "epoch": 2.1754916792738275,
+      "grad_norm": 1.119370698928833,
+      "learning_rate": 1.942951124750071e-05,
+      "loss": 0.2679,
+      "step": 720
+    },
+    {
+      "epoch": 2.190620272314675,
+      "grad_norm": 1.030710220336914,
+      "learning_rate": 1.928609157602189e-05,
+      "loss": 0.2525,
+      "step": 725
+    },
+    {
+      "epoch": 2.205748865355522,
+      "grad_norm": 1.0458844900131226,
+      "learning_rate": 1.9142243955063125e-05,
+      "loss": 0.2608,
+      "step": 730
+    },
+    {
+      "epoch": 2.220877458396369,
+      "grad_norm": 1.1421456336975098,
+      "learning_rate": 1.8997982747246744e-05,
+      "loss": 0.2771,
+      "step": 735
+    },
+    {
+      "epoch": 2.2360060514372164,
+      "grad_norm": 1.0802360773086548,
+      "learning_rate": 1.8853322356490104e-05,
+      "loss": 0.2593,
+      "step": 740
+    },
+    {
+      "epoch": 2.2511346444780633,
+      "grad_norm": 1.1976512670516968,
+      "learning_rate": 1.870827722656743e-05,
+      "loss": 0.2663,
+      "step": 745
+    },
+    {
+      "epoch": 2.2662632375189107,
+      "grad_norm": 1.2698267698287964,
+      "learning_rate": 1.8562861839667635e-05,
+      "loss": 0.2543,
+      "step": 750
+    },
+    {
+      "epoch": 2.281391830559758,
+      "grad_norm": 1.14236581325531,
+      "learning_rate": 1.8417090714948337e-05,
+      "loss": 0.2514,
+      "step": 755
+    },
+    {
+      "epoch": 2.2965204236006054,
+      "grad_norm": 1.158115029335022,
+      "learning_rate": 1.827097840708621e-05,
+      "loss": 0.2572,
+      "step": 760
+    },
+    {
+      "epoch": 2.3116490166414523,
+      "grad_norm": 1.258471965789795,
+      "learning_rate": 1.8124539504823704e-05,
+      "loss": 0.2505,
+      "step": 765
+    },
+    {
+      "epoch": 2.3267776096822996,
+      "grad_norm": 1.1164860725402832,
+      "learning_rate": 1.7977788629512457e-05,
+      "loss": 0.2448,
+      "step": 770
+    },
+    {
+      "epoch": 2.3419062027231465,
+      "grad_norm": 1.2983452081680298,
+      "learning_rate": 1.783074043365339e-05,
+      "loss": 0.2898,
+      "step": 775
+    },
+    {
+      "epoch": 2.357034795763994,
+      "grad_norm": 1.1434799432754517,
+      "learning_rate": 1.7683409599433716e-05,
+      "loss": 0.2645,
+      "step": 780
+    },
+    {
+      "epoch": 2.3721633888048412,
+      "grad_norm": 1.1078280210494995,
+      "learning_rate": 1.7535810837260996e-05,
+      "loss": 0.2317,
+      "step": 785
+    },
+    {
+      "epoch": 2.3872919818456886,
+      "grad_norm": 1.0709105730056763,
+      "learning_rate": 1.7387958884294325e-05,
+      "loss": 0.2591,
+      "step": 790
+    },
+    {
+      "epoch": 2.4024205748865355,
+      "grad_norm": 1.1712156534194946,
+      "learning_rate": 1.723986850297293e-05,
+      "loss": 0.2459,
+      "step": 795
+    },
+    {
+      "epoch": 2.417549167927383,
+      "grad_norm": 1.2031867504119873,
+      "learning_rate": 1.7091554479542172e-05,
+      "loss": 0.2292,
+      "step": 800
+    },
+    {
+      "epoch": 2.4326777609682297,
+      "grad_norm": 1.15472412109375,
+      "learning_rate": 1.6943031622577197e-05,
+      "loss": 0.2307,
+      "step": 805
+    },
+    {
+      "epoch": 2.447806354009077,
+      "grad_norm": 1.0374424457550049,
+      "learning_rate": 1.6794314761504362e-05,
+      "loss": 0.2286,
+      "step": 810
+    },
+    {
+      "epoch": 2.4629349470499244,
+      "grad_norm": 1.168131947517395,
+      "learning_rate": 1.6645418745120583e-05,
+      "loss": 0.2454,
+      "step": 815
+    },
+    {
+      "epoch": 2.478063540090772,
+      "grad_norm": 1.1437524557113647,
+      "learning_rate": 1.6496358440110725e-05,
+      "loss": 0.2301,
+      "step": 820
+    },
+    {
+      "epoch": 2.4931921331316187,
+      "grad_norm": 1.0511492490768433,
+      "learning_rate": 1.6347148729563236e-05,
+      "loss": 0.2301,
+      "step": 825
+    },
+    {
+      "epoch": 2.508320726172466,
+      "grad_norm": 1.3130929470062256,
+      "learning_rate": 1.6197804511484115e-05,
+      "loss": 0.2288,
+      "step": 830
+    },
+    {
+      "epoch": 2.523449319213313,
+      "grad_norm": 1.0749022960662842,
+      "learning_rate": 1.604834069730942e-05,
+      "loss": 0.2339,
+      "step": 835
+    },
+    {
+      "epoch": 2.5385779122541603,
+      "grad_norm": 1.1725904941558838,
+      "learning_rate": 1.589877221041641e-05,
+      "loss": 0.2486,
+      "step": 840
+    },
+    {
+      "epoch": 2.5537065052950076,
+      "grad_norm": 1.171730875968933,
+      "learning_rate": 1.5749113984633504e-05,
+      "loss": 0.2254,
+      "step": 845
+    },
+    {
+      "epoch": 2.568835098335855,
+      "grad_norm": 1.1446218490600586,
+      "learning_rate": 1.5599380962749188e-05,
+      "loss": 0.2028,
+      "step": 850
+    },
+    {
+      "epoch": 2.583963691376702,
+      "grad_norm": 1.14954674243927,
+      "learning_rate": 1.5449588095020064e-05,
+      "loss": 0.2465,
+      "step": 855
+    },
+    {
+      "epoch": 2.5990922844175492,
+      "grad_norm": 1.1885757446289062,
+      "learning_rate": 1.5299750337678096e-05,
+      "loss": 0.2285,
+      "step": 860
+    },
+    {
+      "epoch": 2.614220877458396,
+      "grad_norm": 1.0606110095977783,
+      "learning_rate": 1.514988265143731e-05,
+      "loss": 0.2075,
+      "step": 865
+    },
+    {
+      "epoch": 2.6293494704992435,
+      "grad_norm": 1.3303227424621582,
+      "learning_rate": 1.5e-05,
+      "loss": 0.2273,
+      "step": 870
+    },
+    {
+      "epoch": 2.644478063540091,
+      "grad_norm": 0.9864528179168701,
+      "learning_rate": 1.4850117348562696e-05,
+      "loss": 0.2005,
+      "step": 875
+    },
+    {
+      "epoch": 2.659606656580938,
+      "grad_norm": 1.1116912364959717,
+      "learning_rate": 1.4700249662321903e-05,
+      "loss": 0.2165,
+      "step": 880
+    },
+    {
+      "epoch": 2.674735249621785,
+      "grad_norm": 1.2655029296875,
+      "learning_rate": 1.4550411904979939e-05,
+      "loss": 0.2307,
+      "step": 885
+    },
+    {
+      "epoch": 2.6898638426626325,
+      "grad_norm": 1.167178750038147,
+      "learning_rate": 1.440061903725082e-05,
+      "loss": 0.2081,
+      "step": 890
+    },
+    {
+      "epoch": 2.7049924357034794,
+      "grad_norm": 1.1951887607574463,
+      "learning_rate": 1.4250886015366502e-05,
+      "loss": 0.186,
+      "step": 895
+    },
+    {
+      "epoch": 2.7201210287443267,
+      "grad_norm": 1.2827143669128418,
+      "learning_rate": 1.4101227789583594e-05,
+      "loss": 0.1999,
+      "step": 900
+    },
+    {
+      "epoch": 2.735249621785174,
+      "grad_norm": 1.1490733623504639,
+      "learning_rate": 1.3951659302690578e-05,
+      "loss": 0.2209,
+      "step": 905
+    },
+    {
+      "epoch": 2.7503782148260214,
+      "grad_norm": 1.126470685005188,
+      "learning_rate": 1.3802195488515886e-05,
+      "loss": 0.2234,
+      "step": 910
+    },
+    {
+      "epoch": 2.7655068078668683,
+      "grad_norm": 1.1791654825210571,
+      "learning_rate": 1.3652851270436768e-05,
+      "loss": 0.2272,
+      "step": 915
+    },
+    {
+      "epoch": 2.7806354009077157,
+      "grad_norm": 1.1004488468170166,
+      "learning_rate": 1.3503641559889274e-05,
+      "loss": 0.1967,
+      "step": 920
+    },
+    {
+      "epoch": 2.7957639939485626,
+      "grad_norm": 1.0978676080703735,
+      "learning_rate": 1.335458125487942e-05,
+      "loss": 0.2102,
+      "step": 925
+    },
+    {
+      "epoch": 2.81089258698941,
+      "grad_norm": 1.3805896043777466,
+      "learning_rate": 1.3205685238495642e-05,
+      "loss": 0.1822,
+      "step": 930
+    },
+    {
+      "epoch": 2.8260211800302573,
+      "grad_norm": 1.0566489696502686,
+      "learning_rate": 1.3056968377422804e-05,
+      "loss": 0.1914,
+      "step": 935
+    },
+    {
+      "epoch": 2.8411497730711046,
+      "grad_norm": 1.0609066486358643,
+      "learning_rate": 1.2908445520457832e-05,
+      "loss": 0.2089,
+      "step": 940
+    },
+    {
+      "epoch": 2.8562783661119515,
+      "grad_norm": 1.1454850435256958,
+      "learning_rate": 1.2760131497027073e-05,
+      "loss": 0.1931,
+      "step": 945
+    },
+    {
+      "epoch": 2.871406959152799,
+      "grad_norm": 1.126051902770996,
+      "learning_rate": 1.2612041115705679e-05,
+      "loss": 0.2069,
+      "step": 950
+    },
+    {
+      "epoch": 2.8865355521936458,
+      "grad_norm": 1.1702042818069458,
+      "learning_rate": 1.2464189162739012e-05,
+      "loss": 0.1888,
+      "step": 955
+    },
+    {
+      "epoch": 2.901664145234493,
+      "grad_norm": 1.1732772588729858,
+      "learning_rate": 1.2316590400566286e-05,
+      "loss": 0.1928,
+      "step": 960
+    },
+    {
+      "epoch": 2.9167927382753405,
+      "grad_norm": 1.3314909934997559,
+      "learning_rate": 1.2169259566346612e-05,
+      "loss": 0.2029,
+      "step": 965
+    },
+    {
+      "epoch": 2.931921331316188,
+      "grad_norm": 1.1199489831924438,
+      "learning_rate": 1.2022211370487546e-05,
+      "loss": 0.1871,
+      "step": 970
+    },
+    {
+      "epoch": 2.9470499243570347,
+      "grad_norm": 1.098125696182251,
+      "learning_rate": 1.1875460495176297e-05,
+      "loss": 0.1638,
+      "step": 975
+    },
+    {
+      "epoch": 2.962178517397882,
+      "grad_norm": 1.1100715398788452,
+      "learning_rate": 1.1729021592913791e-05,
+      "loss": 0.2053,
+      "step": 980
+    },
+    {
+      "epoch": 2.977307110438729,
+      "grad_norm": 1.1177160739898682,
+      "learning_rate": 1.1582909285051664e-05,
+      "loss": 0.1992,
+      "step": 985
+    },
+    {
+      "epoch": 2.9924357034795763,
+      "grad_norm": 1.1357407569885254,
+      "learning_rate": 1.1437138160332371e-05,
+      "loss": 0.1834,
+      "step": 990
+    },
+    {
+      "epoch": 3.006051437216339,
+      "grad_norm": 1.0767449140548706,
+      "learning_rate": 1.1291722773432571e-05,
+      "loss": 0.1724,
+      "step": 995
+    },
+    {
+      "epoch": 3.0211800302571863,
+      "grad_norm": 1.0442917346954346,
+      "learning_rate": 1.1146677643509893e-05,
+      "loss": 0.1403,
+      "step": 1000
+    },
+    {
+      "epoch": 3.036308623298033,
+      "grad_norm": 1.0078842639923096,
+      "learning_rate": 1.100201725275326e-05,
+      "loss": 0.1385,
+      "step": 1005
+    },
+    {
+      "epoch": 3.0514372163388805,
+      "grad_norm": 0.955449104309082,
+      "learning_rate": 1.0857756044936876e-05,
+      "loss": 0.1407,
+      "step": 1010
+    },
+    {
+      "epoch": 3.066565809379728,
+      "grad_norm": 1.0730222463607788,
+      "learning_rate": 1.0713908423978111e-05,
+      "loss": 0.1367,
+      "step": 1015
+    },
+    {
+      "epoch": 3.081694402420575,
+      "grad_norm": 1.2181591987609863,
+      "learning_rate": 1.0570488752499293e-05,
+      "loss": 0.1622,
+      "step": 1020
+    },
+    {
+      "epoch": 3.096822995461422,
+      "grad_norm": 1.12200129032135,
+      "learning_rate": 1.042751135039367e-05,
+      "loss": 0.1337,
+      "step": 1025
+    },
+    {
+      "epoch": 3.1119515885022695,
+      "grad_norm": 1.044112205505371,
+      "learning_rate": 1.028499049339562e-05,
+      "loss": 0.1385,
+      "step": 1030
+    },
+    {
+      "epoch": 3.1270801815431164,
+      "grad_norm": 1.2044527530670166,
+      "learning_rate": 1.014294041165529e-05,
+      "loss": 0.1536,
+      "step": 1035
+    },
+    {
+      "epoch": 3.1422087745839637,
+      "grad_norm": 0.9035820364952087,
+      "learning_rate": 1.000137528831774e-05,
+      "loss": 0.1229,
+      "step": 1040
+    },
+    {
+      "epoch": 3.157337367624811,
+      "grad_norm": 1.0843486785888672,
+      "learning_rate": 9.860309258106854e-06,
+      "loss": 0.1419,
+      "step": 1045
+    },
+    {
+      "epoch": 3.172465960665658,
+      "grad_norm": 1.26431405544281,
+      "learning_rate": 9.719756405913997e-06,
+      "loss": 0.1414,
+      "step": 1050
+    },
+    {
+      "epoch": 3.1875945537065054,
+      "grad_norm": 1.0138827562332153,
+      "learning_rate": 9.57973076539172e-06,
+      "loss": 0.138,
+      "step": 1055
+    },
+    {
+      "epoch": 3.2027231467473527,
+      "grad_norm": 1.0034918785095215,
+      "learning_rate": 9.440246317552568e-06,
+      "loss": 0.1295,
+      "step": 1060
+    },
+    {
+      "epoch": 3.2178517397881996,
+      "grad_norm": 1.10910165309906,
+      "learning_rate": 9.301316989373092e-06,
+      "loss": 0.1449,
+      "step": 1065
+    },
+    {
+      "epoch": 3.232980332829047,
+      "grad_norm": 0.9604946970939636,
+      "learning_rate": 9.162956652403324e-06,
+      "loss": 0.1227,
+      "step": 1070
+    },
+    {
+      "epoch": 3.2481089258698943,
+      "grad_norm": 1.1060423851013184,
+      "learning_rate": 9.025179121381763e-06,
+      "loss": 0.1541,
+      "step": 1075
+    },
+    {
+      "epoch": 3.263237518910741,
+      "grad_norm": 1.052452802658081,
+      "learning_rate": 8.88799815285599e-06,
+      "loss": 0.1356,
+      "step": 1080
+    },
+    {
+      "epoch": 3.2783661119515886,
+      "grad_norm": 1.1046732664108276,
+      "learning_rate": 8.751427443809163e-06,
+      "loss": 0.1313,
+      "step": 1085
+    },
+    {
+      "epoch": 3.293494704992436,
+      "grad_norm": 1.0501445531845093,
+      "learning_rate": 8.615480630292426e-06,
+      "loss": 0.1194,
+      "step": 1090
+    },
+    {
+      "epoch": 3.308623298033283,
+      "grad_norm": 1.0697016716003418,
+      "learning_rate": 8.480171286063389e-06,
+      "loss": 0.1469,
+      "step": 1095
+    },
+    {
+      "epoch": 3.32375189107413,
+      "grad_norm": 1.036516785621643,
+      "learning_rate": 8.34551292123085e-06,
+      "loss": 0.1287,
+      "step": 1100
+    },
+    {
+      "epoch": 3.338880484114977,
+      "grad_norm": 0.9603574872016907,
+      "learning_rate": 8.211518980905842e-06,
+      "loss": 0.1261,
+      "step": 1105
+    },
+    {
+      "epoch": 3.3540090771558244,
+      "grad_norm": 1.0846987962722778,
+      "learning_rate": 8.078202843859234e-06,
+      "loss": 0.1188,
+      "step": 1110
+    },
+    {
+      "epoch": 3.3691376701966718,
+      "grad_norm": 1.0942096710205078,
+      "learning_rate": 7.94557782118588e-06,
+      "loss": 0.1252,
+      "step": 1115
+    },
+    {
+      "epoch": 3.384266263237519,
+      "grad_norm": 1.0738424062728882,
+      "learning_rate": 7.813657154975566e-06,
+      "loss": 0.1358,
+      "step": 1120
+    },
+    {
+      "epoch": 3.399394856278366,
+      "grad_norm": 1.0699353218078613,
+      "learning_rate": 7.682454016990836e-06,
+      "loss": 0.1179,
+      "step": 1125
+    },
+    {
+      "epoch": 3.4145234493192134,
+      "grad_norm": 1.213776707649231,
+      "learning_rate": 7.5519815073518615e-06,
+      "loss": 0.1243,
+      "step": 1130
+    },
+    {
+      "epoch": 3.4296520423600603,
+      "grad_norm": 1.0862241983413696,
+      "learning_rate": 7.422252653228401e-06,
+      "loss": 0.1322,
+      "step": 1135
+    },
+    {
+      "epoch": 3.4447806354009076,
+      "grad_norm": 0.9266173243522644,
+      "learning_rate": 7.293280407539161e-06,
+      "loss": 0.127,
+      "step": 1140
+    },
+    {
+      "epoch": 3.459909228441755,
+      "grad_norm": 0.983642041683197,
+      "learning_rate": 7.16507764765842e-06,
+      "loss": 0.1192,
+      "step": 1145
+    },
+    {
+      "epoch": 3.4750378214826023,
+      "grad_norm": 0.9537131786346436,
+      "learning_rate": 7.037657174130322e-06,
+      "loss": 0.1317,
+      "step": 1150
+    },
+    {
+      "epoch": 3.4901664145234492,
+      "grad_norm": 1.0515652894973755,
+      "learning_rate": 6.911031709390778e-06,
+      "loss": 0.1164,
+      "step": 1155
+    },
+    {
+      "epoch": 3.5052950075642966,
+      "grad_norm": 1.1290754079818726,
+      "learning_rate": 6.785213896497187e-06,
+      "loss": 0.1009,
+      "step": 1160
+    },
+    {
+      "epoch": 3.5204236006051435,
+      "grad_norm": 1.0471539497375488,
+      "learning_rate": 6.660216297866044e-06,
+      "loss": 0.1202,
+      "step": 1165
+    },
+    {
+      "epoch": 3.535552193645991,
+      "grad_norm": 1.033564567565918,
+      "learning_rate": 6.536051394018702e-06,
+      "loss": 0.128,
+      "step": 1170
+    },
+    {
+      "epoch": 3.550680786686838,
+      "grad_norm": 1.1593022346496582,
+      "learning_rate": 6.412731582335146e-06,
+      "loss": 0.1196,
+      "step": 1175
+    },
+    {
+      "epoch": 3.5658093797276855,
+      "grad_norm": 1.0917978286743164,
+      "learning_rate": 6.290269175816268e-06,
+      "loss": 0.113,
+      "step": 1180
+    },
+    {
+      "epoch": 3.5809379727685324,
+      "grad_norm": 1.2841700315475464,
+      "learning_rate": 6.168676401854357e-06,
+      "loss": 0.1239,
+      "step": 1185
+    },
+    {
+      "epoch": 3.59606656580938,
+      "grad_norm": 1.0042773485183716,
+      "learning_rate": 6.047965401012324e-06,
+      "loss": 0.118,
+      "step": 1190
+    },
+    {
+      "epoch": 3.6111951588502267,
+      "grad_norm": 0.8104374408721924,
+      "learning_rate": 5.92814822581147e-06,
+      "loss": 0.1183,
+      "step": 1195
+    },
+    {
+      "epoch": 3.626323751891074,
+      "grad_norm": 1.1587086915969849,
+      "learning_rate": 5.809236839528115e-06,
+      "loss": 0.1165,
+      "step": 1200
+    },
+    {
+      "epoch": 3.6414523449319214,
+      "grad_norm": 1.0012660026550293,
+      "learning_rate": 5.6912431149990704e-06,
+      "loss": 0.111,
+      "step": 1205
+    },
+    {
+      "epoch": 3.6565809379727687,
+      "grad_norm": 0.9174544215202332,
+      "learning_rate": 5.5741788334362505e-06,
+      "loss": 0.111,
+      "step": 1210
+    },
+    {
+      "epoch": 3.6717095310136156,
+      "grad_norm": 0.8950276374816895,
+      "learning_rate": 5.458055683250288e-06,
+      "loss": 0.1189,
+      "step": 1215
+    },
+    {
+      "epoch": 3.686838124054463,
+      "grad_norm": 0.8682631850242615,
+      "learning_rate": 5.342885258883548e-06,
+      "loss": 0.1212,
+      "step": 1220
+    },
+    {
+      "epoch": 3.70196671709531,
+      "grad_norm": 1.0850940942764282,
+      "learning_rate": 5.228679059652446e-06,
+      "loss": 0.1148,
+      "step": 1225
+    },
+    {
+      "epoch": 3.7170953101361572,
+      "grad_norm": 1.039988398551941,
+      "learning_rate": 5.115448488599287e-06,
+      "loss": 0.1206,
+      "step": 1230
+    },
+    {
+      "epoch": 3.7322239031770046,
+      "grad_norm": 0.8652891516685486,
+      "learning_rate": 5.003204851353719e-06,
+      "loss": 0.1115,
+      "step": 1235
+    },
+    {
+      "epoch": 3.747352496217852,
+      "grad_norm": 0.9648194313049316,
+      "learning_rate": 4.891959355003916e-06,
+      "loss": 0.1104,
+      "step": 1240
+    },
+    {
+      "epoch": 3.762481089258699,
+      "grad_norm": 0.8316813111305237,
+      "learning_rate": 4.781723106977564e-06,
+      "loss": 0.1154,
+      "step": 1245
+    },
+    {
+      "epoch": 3.777609682299546,
+      "grad_norm": 0.9703608155250549,
+      "learning_rate": 4.672507113932888e-06,
+      "loss": 0.1208,
+      "step": 1250
+    },
+    {
+      "epoch": 3.792738275340393,
+      "grad_norm": 0.9646096229553223,
+      "learning_rate": 4.564322280659612e-06,
+      "loss": 0.1031,
+      "step": 1255
+    },
+    {
+      "epoch": 3.8078668683812404,
+      "grad_norm": 0.8670827746391296,
+      "learning_rate": 4.457179408990203e-06,
+      "loss": 0.104,
+      "step": 1260
+    },
+    {
+      "epoch": 3.822995461422088,
+      "grad_norm": 0.9630376696586609,
+      "learning_rate": 4.35108919672134e-06,
+      "loss": 0.1069,
+      "step": 1265
+    },
+    {
+      "epoch": 3.838124054462935,
+      "grad_norm": 0.9716989994049072,
+      "learning_rate": 4.246062236545771e-06,
+      "loss": 0.1154,
+      "step": 1270
+    },
+    {
+      "epoch": 3.853252647503782,
+      "grad_norm": 0.8012140989303589,
+      "learning_rate": 4.142109014994685e-06,
+      "loss": 0.1086,
+      "step": 1275
+    },
+    {
+      "epoch": 3.8683812405446294,
+      "grad_norm": 0.9220558404922485,
+      "learning_rate": 4.0392399113906735e-06,
+      "loss": 0.1132,
+      "step": 1280
+    },
+    {
+      "epoch": 3.8835098335854763,
+      "grad_norm": 0.9726728796958923,
+      "learning_rate": 3.937465196811375e-06,
+      "loss": 0.1108,
+      "step": 1285
+    },
+    {
+      "epoch": 3.8986384266263236,
+      "grad_norm": 0.9615852236747742,
+      "learning_rate": 3.836795033063982e-06,
+      "loss": 0.1134,
+      "step": 1290
+    },
+    {
+      "epoch": 3.913767019667171,
+      "grad_norm": 0.8726525902748108,
+      "learning_rate": 3.7372394716706e-06,
+      "loss": 0.1106,
+      "step": 1295
+    },
+    {
+      "epoch": 3.9288956127080183,
+      "grad_norm": 1.0038763284683228,
+      "learning_rate": 3.638808452864646e-06,
+      "loss": 0.1073,
+      "step": 1300
+    },
+    {
+      "epoch": 3.9440242057488653,
+      "grad_norm": 0.9384374618530273,
+      "learning_rate": 3.5415118045983635e-06,
+      "loss": 0.1091,
+      "step": 1305
+    },
+    {
+      "epoch": 3.9591527987897126,
+      "grad_norm": 0.9748557209968567,
+      "learning_rate": 3.4453592415615336e-06,
+      "loss": 0.1147,
+      "step": 1310
+    },
+    {
+      "epoch": 3.9742813918305595,
+      "grad_norm": 0.939871072769165,
+      "learning_rate": 3.350360364211494e-06,
+      "loss": 0.1132,
+      "step": 1315
+    },
+    {
+      "epoch": 3.989409984871407,
+      "grad_norm": 0.8852137327194214,
+      "learning_rate": 3.256524657814588e-06,
+      "loss": 0.1022,
+      "step": 1320
+    },
+    {
+      "epoch": 4.0030257186081695,
+      "grad_norm": 0.7500025033950806,
+      "learning_rate": 3.163861491499086e-06,
+      "loss": 0.1008,
+      "step": 1325
+    },
+    {
+      "epoch": 4.018154311649017,
+      "grad_norm": 0.886324942111969,
+      "learning_rate": 3.0723801173197153e-06,
+      "loss": 0.0754,
+      "step": 1330
+    },
+    {
+      "epoch": 4.033282904689864,
+      "grad_norm": 0.8764703273773193,
+      "learning_rate": 2.9820896693338846e-06,
+      "loss": 0.0874,
+      "step": 1335
+    },
+    {
+      "epoch": 4.048411497730711,
+      "grad_norm": 0.8988645672798157,
+      "learning_rate": 2.8929991626896786e-06,
+      "loss": 0.0794,
+      "step": 1340
+    },
+    {
+      "epoch": 4.063540090771558,
+      "grad_norm": 0.8474794030189514,
+      "learning_rate": 2.805117492725731e-06,
+      "loss": 0.0889,
+      "step": 1345
+    },
+    {
+      "epoch": 4.078668683812405,
+      "grad_norm": 0.8318480253219604,
+      "learning_rate": 2.71845343408306e-06,
+      "loss": 0.076,
+      "step": 1350
+    },
+    {
+      "epoch": 4.093797276853253,
+      "grad_norm": 0.7629009485244751,
+      "learning_rate": 2.633015639828957e-06,
+      "loss": 0.081,
+      "step": 1355
+    },
+    {
+      "epoch": 4.1089258698941,
+      "grad_norm": 0.8618624806404114,
+      "learning_rate": 2.5488126405930117e-06,
+      "loss": 0.0795,
+      "step": 1360
+    },
+    {
+      "epoch": 4.124054462934947,
+      "grad_norm": 0.8529701232910156,
+      "learning_rate": 2.4658528437153605e-06,
+      "loss": 0.0927,
+      "step": 1365
+    },
+    {
+      "epoch": 4.139183055975794,
+      "grad_norm": 0.8088263869285583,
+      "learning_rate": 2.3841445324072466e-06,
+      "loss": 0.0853,
+      "step": 1370
+    },
+    {
+      "epoch": 4.154311649016641,
+      "grad_norm": 0.9408968687057495,
+      "learning_rate": 2.303695864923976e-06,
+      "loss": 0.0848,
+      "step": 1375
+    },
+    {
+      "epoch": 4.1694402420574885,
+      "grad_norm": 0.7560146450996399,
+      "learning_rate": 2.2245148737503345e-06,
+      "loss": 0.0968,
+      "step": 1380
+    },
+    {
+      "epoch": 4.184568835098336,
+      "grad_norm": 0.761082649230957,
+      "learning_rate": 2.1466094647986055e-06,
+      "loss": 0.0856,
+      "step": 1385
+    },
+    {
+      "epoch": 4.199697428139183,
+      "grad_norm": 0.7905988097190857,
+      "learning_rate": 2.0699874166191597e-06,
+      "loss": 0.0817,
+      "step": 1390
+    },
+    {
+      "epoch": 4.214826021180031,
+      "grad_norm": 0.8756046891212463,
+      "learning_rate": 1.9946563796238237e-06,
+      "loss": 0.0877,
+      "step": 1395
+    },
+    {
+      "epoch": 4.229954614220877,
+      "grad_norm": 0.7569416761398315,
+      "learning_rate": 1.920623875322002e-06,
+      "loss": 0.0801,
+      "step": 1400
+    },
+    {
+      "epoch": 4.245083207261724,
+      "grad_norm": 0.7701637148857117,
+      "learning_rate": 1.8478972955696944e-06,
+      "loss": 0.0841,
+      "step": 1405
+    },
+    {
+      "epoch": 4.260211800302572,
+      "grad_norm": 0.8360387682914734,
+      "learning_rate": 1.7764839018314293e-06,
+      "loss": 0.0824,
+      "step": 1410
+    },
+    {
+      "epoch": 4.275340393343419,
+      "grad_norm": 0.7571801543235779,
+      "learning_rate": 1.706390824455269e-06,
+      "loss": 0.0816,
+      "step": 1415
+    },
+    {
+      "epoch": 4.290468986384266,
+      "grad_norm": 0.6919256448745728,
+      "learning_rate": 1.637625061960827e-06,
+      "loss": 0.0802,
+      "step": 1420
+    },
+    {
+      "epoch": 4.305597579425114,
+      "grad_norm": 0.8499504923820496,
+      "learning_rate": 1.5701934803405393e-06,
+      "loss": 0.08,
+      "step": 1425
+    },
+    {
+      "epoch": 4.32072617246596,
+      "grad_norm": 0.8733304738998413,
+      "learning_rate": 1.5041028123740853e-06,
+      "loss": 0.0812,
+      "step": 1430
+    },
+    {
+      "epoch": 4.335854765506808,
+      "grad_norm": 0.8591729998588562,
+      "learning_rate": 1.4393596569561635e-06,
+      "loss": 0.0799,
+      "step": 1435
+    },
+    {
+      "epoch": 4.350983358547655,
+      "grad_norm": 0.8090836405754089,
+      "learning_rate": 1.3759704784376186e-06,
+      "loss": 0.0792,
+      "step": 1440
+    },
+    {
+      "epoch": 4.366111951588502,
+      "grad_norm": 0.7546667456626892,
+      "learning_rate": 1.3139416059799975e-06,
+      "loss": 0.0852,
+      "step": 1445
+    },
+    {
+      "epoch": 4.38124054462935,
+      "grad_norm": 0.7740128636360168,
+      "learning_rate": 1.2532792329235988e-06,
+      "loss": 0.0802,
+      "step": 1450
+    },
+    {
+      "epoch": 4.396369137670197,
+      "grad_norm": 0.8964858651161194,
+      "learning_rate": 1.1939894161691185e-06,
+      "loss": 0.083,
+      "step": 1455
+    },
+    {
+      "epoch": 4.411497730711044,
+      "grad_norm": 0.9138798117637634,
+      "learning_rate": 1.1360780755728484e-06,
+      "loss": 0.0904,
+      "step": 1460
+    },
+    {
+      "epoch": 4.426626323751891,
+      "grad_norm": 0.7522209286689758,
+      "learning_rate": 1.0795509933556575e-06,
+      "loss": 0.0844,
+      "step": 1465
+    },
+    {
+      "epoch": 4.441754916792738,
+      "grad_norm": 0.8203308582305908,
+      "learning_rate": 1.0244138135256031e-06,
+      "loss": 0.0773,
+      "step": 1470
+    },
+    {
+      "epoch": 4.4568835098335855,
+      "grad_norm": 0.8203104734420776,
+      "learning_rate": 9.70672041314441e-07,
+      "loss": 0.0803,
+      "step": 1475
+    },
+    {
+      "epoch": 4.472012102874433,
+      "grad_norm": 0.7725104093551636,
+      "learning_rate": 9.18331042627929e-07,
+      "loss": 0.0766,
+      "step": 1480
+    },
+    {
+      "epoch": 4.48714069591528,
+      "grad_norm": 0.8105193376541138,
+      "learning_rate": 8.673960435100698e-07,
+      "loss": 0.0795,
+      "step": 1485
+    },
+    {
+      "epoch": 4.502269288956127,
+      "grad_norm": 0.7565754652023315,
+      "learning_rate": 8.178721296213009e-07,
+      "loss": 0.0778,
+      "step": 1490
+    },
+    {
+      "epoch": 4.517397881996974,
+      "grad_norm": 0.8591950535774231,
+      "learning_rate": 7.697642457307319e-07,
+      "loss": 0.0828,
+      "step": 1495
+    },
+    {
+      "epoch": 4.532526475037821,
+      "grad_norm": 0.7466459274291992,
+      "learning_rate": 7.23077195222403e-07,
+      "loss": 0.0752,
+      "step": 1500
+    },
+    {
+      "epoch": 4.547655068078669,
+      "grad_norm": 0.7037179470062256,
+      "learning_rate": 6.778156396157048e-07,
+      "loss": 0.0802,
+      "step": 1505
+    },
+    {
+      "epoch": 4.562783661119516,
+      "grad_norm": 0.8281891942024231,
+      "learning_rate": 6.339840980999351e-07,
+      "loss": 0.0848,
+      "step": 1510
+    },
+    {
+      "epoch": 4.577912254160363,
+      "grad_norm": 0.6697319746017456,
+      "learning_rate": 5.915869470830781e-07,
+      "loss": 0.0878,
+      "step": 1515
+    },
+    {
+      "epoch": 4.593040847201211,
+      "grad_norm": 0.8832966089248657,
+      "learning_rate": 5.506284197548395e-07,
+      "loss": 0.0837,
+      "step": 1520
+    },
+    {
+      "epoch": 4.608169440242057,
+      "grad_norm": 0.7058913111686707,
+      "learning_rate": 5.11112605663977e-07,
+      "loss": 0.0882,
+      "step": 1525
+    },
+    {
+      "epoch": 4.623298033282905,
+      "grad_norm": 0.8949100971221924,
+      "learning_rate": 4.7304345030996623e-07,
+      "loss": 0.0833,
+      "step": 1530
+    },
+    {
+      "epoch": 4.638426626323752,
+      "grad_norm": 0.8610048294067383,
+      "learning_rate": 4.364247547490735e-07,
+      "loss": 0.0894,
+      "step": 1535
+    },
+    {
+      "epoch": 4.653555219364599,
+      "grad_norm": 0.6452397704124451,
+      "learning_rate": 4.012601752148265e-07,
+      "loss": 0.0788,
+      "step": 1540
+    },
+    {
+      "epoch": 4.668683812405447,
+      "grad_norm": 0.7913710474967957,
+      "learning_rate": 3.67553222752956e-07,
+      "loss": 0.08,
+      "step": 1545
+    },
+    {
+      "epoch": 4.683812405446293,
+      "grad_norm": 0.6712328791618347,
+      "learning_rate": 3.353072628708298e-07,
+      "loss": 0.0759,
+      "step": 1550
+    },
+    {
+      "epoch": 4.69894099848714,
+      "grad_norm": 0.8716601133346558,
+      "learning_rate": 3.0452551520141647e-07,
+      "loss": 0.0839,
+      "step": 1555
+    },
+    {
+      "epoch": 4.714069591527988,
+      "grad_norm": 0.7100825905799866,
+      "learning_rate": 2.752110531818325e-07,
+      "loss": 0.0785,
+      "step": 1560
+    },
+    {
+      "epoch": 4.729198184568835,
+      "grad_norm": 0.7657788395881653,
+      "learning_rate": 2.473668037464494e-07,
+      "loss": 0.0759,
+      "step": 1565
+    },
+    {
+      "epoch": 4.7443267776096825,
+      "grad_norm": 0.7530739307403564,
+      "learning_rate": 2.2099554703466916e-07,
+      "loss": 0.0841,
+      "step": 1570
+    },
+    {
+      "epoch": 4.75945537065053,
+      "grad_norm": 0.812310516834259,
+      "learning_rate": 1.9609991611333145e-07,
+      "loss": 0.0835,
+      "step": 1575
+    },
+    {
+      "epoch": 4.774583963691377,
+      "grad_norm": 0.8391323685646057,
+      "learning_rate": 1.7268239671381025e-07,
+      "loss": 0.0802,
+      "step": 1580
+    },
+    {
+      "epoch": 4.789712556732224,
+      "grad_norm": 0.7643956542015076,
+      "learning_rate": 1.5074532698382438e-07,
+      "loss": 0.0749,
+      "step": 1585
+    },
+    {
+      "epoch": 4.804841149773071,
+      "grad_norm": 0.8218677043914795,
+      "learning_rate": 1.30290897253989e-07,
+      "loss": 0.0831,
+      "step": 1590
+    },
+    {
+      "epoch": 4.819969742813918,
+      "grad_norm": 0.8395576477050781,
+      "learning_rate": 1.1132114981910912e-07,
+      "loss": 0.0842,
+      "step": 1595
+    },
+    {
+      "epoch": 4.835098335854766,
+      "grad_norm": 0.7999187111854553,
+      "learning_rate": 9.383797873426914e-08,
+      "loss": 0.0828,
+      "step": 1600
+    },
+    {
+      "epoch": 4.850226928895613,
+      "grad_norm": 0.7831802368164062,
+      "learning_rate": 7.78431296257226e-08,
+      "loss": 0.079,
+      "step": 1605
+    },
+    {
+      "epoch": 4.8653555219364595,
+      "grad_norm": 0.7924562692642212,
+      "learning_rate": 6.333819951659159e-08,
+      "loss": 0.0839,
+      "step": 1610
+    },
+    {
+      "epoch": 4.880484114977307,
+      "grad_norm": 0.8063008189201355,
+      "learning_rate": 5.0324636667419265e-08,
+      "loss": 0.0881,
+      "step": 1615
+    },
+    {
+      "epoch": 4.895612708018154,
+      "grad_norm": 0.6993975639343262,
+      "learning_rate": 3.880374043155388e-08,
+      "loss": 0.0824,
+      "step": 1620
+    },
+    {
+      "epoch": 4.9107413010590015,
+      "grad_norm": 0.7667171955108643,
+      "learning_rate": 2.8776661125422542e-08,
+      "loss": 0.0861,
+      "step": 1625
+    },
+    {
+      "epoch": 4.925869894099849,
+      "grad_norm": 0.7333769202232361,
+      "learning_rate": 2.0244399913679767e-08,
+      "loss": 0.0857,
+      "step": 1630
+    },
+    {
+      "epoch": 4.940998487140696,
+      "grad_norm": 0.7798089385032654,
+      "learning_rate": 1.320780870923244e-08,
+      "loss": 0.0749,
+      "step": 1635
+    },
+    {
+      "epoch": 4.956127080181544,
+      "grad_norm": 0.802169919013977,
+      "learning_rate": 7.667590088191179e-09,
+      "loss": 0.0752,
+      "step": 1640
+    },
+    {
+      "epoch": 4.97125567322239,
+      "grad_norm": 0.9189748167991638,
+      "learning_rate": 3.624297219718131e-09,
+      "loss": 0.0847,
+      "step": 1645
+    },
+    {
+      "epoch": 4.986384266263237,
+      "grad_norm": 0.7418760657310486,
+      "learning_rate": 1.078333810789478e-09,
+      "loss": 0.0774,
+      "step": 1650
+    },
+    {
+      "epoch": 5.0,
+      "grad_norm": 1.0475550889968872,
+      "learning_rate": 2.995406589434424e-11,
+      "loss": 0.0869,
+      "step": 1655
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1655,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 2.483806418474369e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

20_128_e5_3e-5/checkpoint-1655/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2fdd8e6403af008b87810dc8e1147b217350447ed548cf0396b8b43910b391a1
+size 7736

20_128_e5_3e-5/checkpoint-1655/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-1655/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

20_128_e5_3e-5/checkpoint-331/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

20_128_e5_3e-5/checkpoint-331/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "v_proj",
+    "down_proj",
+    "o_proj",
+    "q_proj",
+    "k_proj",
+    "gate_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

20_128_e5_3e-5/checkpoint-331/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5f2a117de932c760cd245f52b9562f504ac2fdf0e3d1e2d4076cafdd8c4b6b9
+size 791751704

20_128_e5_3e-5/checkpoint-331/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step331

20_128_e5_3e-5/checkpoint-331/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

20_128_e5_3e-5/checkpoint-331/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d7020498a225e1ea8d2023e83d9da8a704e7bdaa35751322ec35988f1846d07
+size 15920

20_128_e5_3e-5/checkpoint-331/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25b1bc82003a545ae4ee4b259fceb45aaccd6b24dd9d9064a2e128c3e81dc45d
+size 15920

20_128_e5_3e-5/checkpoint-331/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee952c55554a691d9a50af4da15082edb4c3ef031f986a9cecf3e386caf7c251
+size 15920