RayDu0010 commited on Jun 26, 2025

Commit

ede7412

verified ·

1 Parent(s): 9d219e3

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

12_128_e5_3e-5/checkpoint-1014/README.md +202 -0
12_128_e5_3e-5/checkpoint-1014/adapter_config.json +39 -0
12_128_e5_3e-5/checkpoint-1014/adapter_model.safetensors +3 -0
12_128_e5_3e-5/checkpoint-1014/latest +1 -0
12_128_e5_3e-5/checkpoint-1014/merges.txt +0 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_0.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_1.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_2.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_3.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_4.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_5.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_6.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/rng_state_7.pth +3 -0
12_128_e5_3e-5/checkpoint-1014/scheduler.pt +3 -0
12_128_e5_3e-5/checkpoint-1014/special_tokens_map.json +45 -0
12_128_e5_3e-5/checkpoint-1014/tokenizer.json +0 -0
12_128_e5_3e-5/checkpoint-1014/tokenizer_config.json +188 -0
12_128_e5_3e-5/checkpoint-1014/trainer_state.json +1448 -0
12_128_e5_3e-5/checkpoint-1014/training_args.bin +3 -0
12_128_e5_3e-5/checkpoint-1014/vocab.json +0 -0
12_128_e5_3e-5/checkpoint-1014/zero_to_fp32.py +604 -0
12_128_e5_3e-5/checkpoint-1352/README.md +202 -0
12_128_e5_3e-5/checkpoint-1352/adapter_config.json +39 -0
12_128_e5_3e-5/checkpoint-1352/adapter_model.safetensors +3 -0
12_128_e5_3e-5/checkpoint-1352/latest +1 -0
12_128_e5_3e-5/checkpoint-1352/merges.txt +0 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_0.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_1.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_2.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_3.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_4.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_5.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_6.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/rng_state_7.pth +3 -0
12_128_e5_3e-5/checkpoint-1352/scheduler.pt +3 -0
12_128_e5_3e-5/checkpoint-1352/special_tokens_map.json +45 -0
12_128_e5_3e-5/checkpoint-1352/tokenizer.json +0 -0
12_128_e5_3e-5/checkpoint-1352/tokenizer_config.json +188 -0
12_128_e5_3e-5/checkpoint-1352/trainer_state.json +1924 -0
12_128_e5_3e-5/checkpoint-1352/training_args.bin +3 -0
12_128_e5_3e-5/checkpoint-1352/vocab.json +0 -0
12_128_e5_3e-5/checkpoint-1352/zero_to_fp32.py +604 -0
12_128_e5_3e-5/checkpoint-1690/README.md +202 -0
12_128_e5_3e-5/checkpoint-1690/adapter_config.json +39 -0
12_128_e5_3e-5/checkpoint-1690/adapter_model.safetensors +3 -0
12_128_e5_3e-5/checkpoint-1690/latest +1 -0
12_128_e5_3e-5/checkpoint-1690/merges.txt +0 -0
12_128_e5_3e-5/checkpoint-1690/rng_state_0.pth +3 -0
12_128_e5_3e-5/checkpoint-1690/rng_state_1.pth +3 -0
12_128_e5_3e-5/checkpoint-1690/rng_state_2.pth +3 -0

12_128_e5_3e-5/checkpoint-1014/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

12_128_e5_3e-5/checkpoint-1014/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "up_proj",
+    "k_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

12_128_e5_3e-5/checkpoint-1014/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cdd88b3df8433b88cf36eaff71fd2891b0b1b8810d7a5830bded49b09903da36
+size 791751704

12_128_e5_3e-5/checkpoint-1014/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1014

12_128_e5_3e-5/checkpoint-1014/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1014/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:676555e16b6b7fc053e04e7d88f5d59b447f5157f47b9aea6afe5f5bb18ffd98
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27923eeb4197aeccd9536abae743a52249f2b64e0b43c509de6c82b7f26601c4
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:336a4e0f5e03072bed1b6b26af746c014a7762b37109460c92cdad1e4a6f8498
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4cb50b9b268010b036c061c24c1c9e937688f7b23e7a5298c878e70ef99fa3d2
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1575336999a9dea8ce868bfef454147f71e4efebb5182557704b2f031867a32
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:990d542c75dcf8ef2fe4ffc5032b5673eece89ee9989392ed0afb24303bbc167
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2046dfb04556ad90e03b45e16771ebec44b94c158ad37ddd915eb5559752ab9b
+size 15920

12_128_e5_3e-5/checkpoint-1014/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7886c5ef39a875d0e2d0faae2ec7ef3f2861ea1ac14c5fc5141fe67e84ff2427
+size 15920

12_128_e5_3e-5/checkpoint-1014/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef6bd2211324348e41b83ddbcce0bc361fdba046fc162226c3257fdfd5819165
+size 1064

12_128_e5_3e-5/checkpoint-1014/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

12_128_e5_3e-5/checkpoint-1014/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1014/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

12_128_e5_3e-5/checkpoint-1014/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1448 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 3.0,
+  "eval_steps": 500,
+  "global_step": 1014,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.014814814814814815,
+      "grad_norm": 1.205528974533081,
+      "learning_rate": 1.411764705882353e-06,
+      "loss": 1.3542,
+      "step": 5
+    },
+    {
+      "epoch": 0.02962962962962963,
+      "grad_norm": 0.9048640131950378,
+      "learning_rate": 3.1764705882352943e-06,
+      "loss": 1.2984,
+      "step": 10
+    },
+    {
+      "epoch": 0.044444444444444446,
+      "grad_norm": 0.7668255567550659,
+      "learning_rate": 4.941176470588235e-06,
+      "loss": 1.3011,
+      "step": 15
+    },
+    {
+      "epoch": 0.05925925925925926,
+      "grad_norm": 0.6333377957344055,
+      "learning_rate": 6.705882352941177e-06,
+      "loss": 1.2844,
+      "step": 20
+    },
+    {
+      "epoch": 0.07407407407407407,
+      "grad_norm": 0.7129969000816345,
+      "learning_rate": 8.470588235294118e-06,
+      "loss": 1.2659,
+      "step": 25
+    },
+    {
+      "epoch": 0.08888888888888889,
+      "grad_norm": 1.9554294347763062,
+      "learning_rate": 1.023529411764706e-05,
+      "loss": 1.267,
+      "step": 30
+    },
+    {
+      "epoch": 0.1037037037037037,
+      "grad_norm": 0.7532616853713989,
+      "learning_rate": 1.2e-05,
+      "loss": 1.1683,
+      "step": 35
+    },
+    {
+      "epoch": 0.11851851851851852,
+      "grad_norm": 0.5177083611488342,
+      "learning_rate": 1.3764705882352941e-05,
+      "loss": 1.2087,
+      "step": 40
+    },
+    {
+      "epoch": 0.13333333333333333,
+      "grad_norm": 0.48525315523147583,
+      "learning_rate": 1.5529411764705886e-05,
+      "loss": 1.2043,
+      "step": 45
+    },
+    {
+      "epoch": 0.14814814814814814,
+      "grad_norm": 0.4408872723579407,
+      "learning_rate": 1.7294117647058823e-05,
+      "loss": 1.1894,
+      "step": 50
+    },
+    {
+      "epoch": 0.16296296296296298,
+      "grad_norm": 0.684490442276001,
+      "learning_rate": 1.9058823529411764e-05,
+      "loss": 1.183,
+      "step": 55
+    },
+    {
+      "epoch": 0.17777777777777778,
+      "grad_norm": 0.4762347936630249,
+      "learning_rate": 2.0823529411764705e-05,
+      "loss": 1.192,
+      "step": 60
+    },
+    {
+      "epoch": 0.1925925925925926,
+      "grad_norm": 0.5843256115913391,
+      "learning_rate": 2.2588235294117646e-05,
+      "loss": 1.1925,
+      "step": 65
+    },
+    {
+      "epoch": 0.2074074074074074,
+      "grad_norm": 0.46570947766304016,
+      "learning_rate": 2.4352941176470587e-05,
+      "loss": 1.1338,
+      "step": 70
+    },
+    {
+      "epoch": 0.2222222222222222,
+      "grad_norm": 0.5429889559745789,
+      "learning_rate": 2.6117647058823532e-05,
+      "loss": 1.1232,
+      "step": 75
+    },
+    {
+      "epoch": 0.23703703703703705,
+      "grad_norm": 0.5680948495864868,
+      "learning_rate": 2.7882352941176473e-05,
+      "loss": 1.1593,
+      "step": 80
+    },
+    {
+      "epoch": 0.2518518518518518,
+      "grad_norm": 0.46284499764442444,
+      "learning_rate": 2.9647058823529414e-05,
+      "loss": 1.0776,
+      "step": 85
+    },
+    {
+      "epoch": 0.26666666666666666,
+      "grad_norm": 0.5861993432044983,
+      "learning_rate": 2.9999540242630432e-05,
+      "loss": 1.1084,
+      "step": 90
+    },
+    {
+      "epoch": 0.2814814814814815,
+      "grad_norm": 0.5396347641944885,
+      "learning_rate": 2.9997672526619356e-05,
+      "loss": 1.1303,
+      "step": 95
+    },
+    {
+      "epoch": 0.2962962962962963,
+      "grad_norm": 0.5758911967277527,
+      "learning_rate": 2.999436829588809e-05,
+      "loss": 1.1098,
+      "step": 100
+    },
+    {
+      "epoch": 0.3111111111111111,
+      "grad_norm": 0.547749936580658,
+      "learning_rate": 2.9989627866924146e-05,
+      "loss": 1.0519,
+      "step": 105
+    },
+    {
+      "epoch": 0.32592592592592595,
+      "grad_norm": 0.5769826769828796,
+      "learning_rate": 2.9983451693777715e-05,
+      "loss": 1.0643,
+      "step": 110
+    },
+    {
+      "epoch": 0.34074074074074073,
+      "grad_norm": 0.6840924620628357,
+      "learning_rate": 2.9975840368018158e-05,
+      "loss": 1.0418,
+      "step": 115
+    },
+    {
+      "epoch": 0.35555555555555557,
+      "grad_norm": 0.874555230140686,
+      "learning_rate": 2.9966794618677357e-05,
+      "loss": 1.0394,
+      "step": 120
+    },
+    {
+      "epoch": 0.37037037037037035,
+      "grad_norm": 0.6433928608894348,
+      "learning_rate": 2.99563153121799e-05,
+      "loss": 1.0176,
+      "step": 125
+    },
+    {
+      "epoch": 0.3851851851851852,
+      "grad_norm": 0.6158267855644226,
+      "learning_rate": 2.9944403452260055e-05,
+      "loss": 0.9896,
+      "step": 130
+    },
+    {
+      "epoch": 0.4,
+      "grad_norm": 0.7418085336685181,
+      "learning_rate": 2.9931060179865677e-05,
+      "loss": 1.0187,
+      "step": 135
+    },
+    {
+      "epoch": 0.4148148148148148,
+      "grad_norm": 0.625586986541748,
+      "learning_rate": 2.991628677304888e-05,
+      "loss": 0.9369,
+      "step": 140
+    },
+    {
+      "epoch": 0.42962962962962964,
+      "grad_norm": 0.7202538847923279,
+      "learning_rate": 2.990008464684366e-05,
+      "loss": 0.9627,
+      "step": 145
+    },
+    {
+      "epoch": 0.4444444444444444,
+      "grad_norm": 0.666614830493927,
+      "learning_rate": 2.9882455353130327e-05,
+      "loss": 0.9299,
+      "step": 150
+    },
+    {
+      "epoch": 0.45925925925925926,
+      "grad_norm": 0.6723878979682922,
+      "learning_rate": 2.9863400580486884e-05,
+      "loss": 0.9275,
+      "step": 155
+    },
+    {
+      "epoch": 0.4740740740740741,
+      "grad_norm": 0.6323536038398743,
+      "learning_rate": 2.984292215402729e-05,
+      "loss": 0.8664,
+      "step": 160
+    },
+    {
+      "epoch": 0.4888888888888889,
+      "grad_norm": 0.837838888168335,
+      "learning_rate": 2.982102203522663e-05,
+      "loss": 0.8963,
+      "step": 165
+    },
+    {
+      "epoch": 0.5037037037037037,
+      "grad_norm": 0.7143777012825012,
+      "learning_rate": 2.9797702321733254e-05,
+      "loss": 0.8955,
+      "step": 170
+    },
+    {
+      "epoch": 0.5185185185185185,
+      "grad_norm": 0.7509482502937317,
+      "learning_rate": 2.9772965247167855e-05,
+      "loss": 0.9471,
+      "step": 175
+    },
+    {
+      "epoch": 0.5333333333333333,
+      "grad_norm": 0.7797493934631348,
+      "learning_rate": 2.974681318090953e-05,
+      "loss": 0.8739,
+      "step": 180
+    },
+    {
+      "epoch": 0.5481481481481482,
+      "grad_norm": 0.845409631729126,
+      "learning_rate": 2.9719248627868823e-05,
+      "loss": 0.8954,
+      "step": 185
+    },
+    {
+      "epoch": 0.562962962962963,
+      "grad_norm": 0.8383833765983582,
+      "learning_rate": 2.9690274228247825e-05,
+      "loss": 0.919,
+      "step": 190
+    },
+    {
+      "epoch": 0.5777777777777777,
+      "grad_norm": 0.8013415932655334,
+      "learning_rate": 2.9659892757287247e-05,
+      "loss": 0.8092,
+      "step": 195
+    },
+    {
+      "epoch": 0.5925925925925926,
+      "grad_norm": 0.7464110255241394,
+      "learning_rate": 2.9628107125000648e-05,
+      "loss": 0.8591,
+      "step": 200
+    },
+    {
+      "epoch": 0.6074074074074074,
+      "grad_norm": 0.9086791276931763,
+      "learning_rate": 2.959492037589567e-05,
+      "loss": 0.8159,
+      "step": 205
+    },
+    {
+      "epoch": 0.6222222222222222,
+      "grad_norm": 0.7836911678314209,
+      "learning_rate": 2.9560335688682443e-05,
+      "loss": 0.8523,
+      "step": 210
+    },
+    {
+      "epoch": 0.6370370370370371,
+      "grad_norm": 0.8023139834403992,
+      "learning_rate": 2.952435637596912e-05,
+      "loss": 0.8181,
+      "step": 215
+    },
+    {
+      "epoch": 0.6518518518518519,
+      "grad_norm": 0.9165554046630859,
+      "learning_rate": 2.9486985883944586e-05,
+      "loss": 0.8079,
+      "step": 220
+    },
+    {
+      "epoch": 0.6666666666666666,
+      "grad_norm": 0.8642603158950806,
+      "learning_rate": 2.944822779204837e-05,
+      "loss": 0.7844,
+      "step": 225
+    },
+    {
+      "epoch": 0.6814814814814815,
+      "grad_norm": 0.8418120741844177,
+      "learning_rate": 2.9408085812627797e-05,
+      "loss": 0.754,
+      "step": 230
+    },
+    {
+      "epoch": 0.6962962962962963,
+      "grad_norm": 0.8577683568000793,
+      "learning_rate": 2.9366563790582416e-05,
+      "loss": 0.8121,
+      "step": 235
+    },
+    {
+      "epoch": 0.7111111111111111,
+      "grad_norm": 0.8663591146469116,
+      "learning_rate": 2.932366570299573e-05,
+      "loss": 0.7656,
+      "step": 240
+    },
+    {
+      "epoch": 0.725925925925926,
+      "grad_norm": 0.8809316158294678,
+      "learning_rate": 2.927939565875424e-05,
+      "loss": 0.7573,
+      "step": 245
+    },
+    {
+      "epoch": 0.7407407407407407,
+      "grad_norm": 0.9155621528625488,
+      "learning_rate": 2.9233757898153907e-05,
+      "loss": 0.7946,
+      "step": 250
+    },
+    {
+      "epoch": 0.7555555555555555,
+      "grad_norm": 0.9761556386947632,
+      "learning_rate": 2.9186756792493996e-05,
+      "loss": 0.7504,
+      "step": 255
+    },
+    {
+      "epoch": 0.7703703703703704,
+      "grad_norm": 0.8892177939414978,
+      "learning_rate": 2.9138396843658383e-05,
+      "loss": 0.7275,
+      "step": 260
+    },
+    {
+      "epoch": 0.7851851851851852,
+      "grad_norm": 1.0526090860366821,
+      "learning_rate": 2.9088682683684363e-05,
+      "loss": 0.7361,
+      "step": 265
+    },
+    {
+      "epoch": 0.8,
+      "grad_norm": 0.9149707555770874,
+      "learning_rate": 2.9037619074318955e-05,
+      "loss": 0.6894,
+      "step": 270
+    },
+    {
+      "epoch": 0.8148148148148148,
+      "grad_norm": 0.9914748668670654,
+      "learning_rate": 2.8985210906562845e-05,
+      "loss": 0.6885,
+      "step": 275
+    },
+    {
+      "epoch": 0.8296296296296296,
+      "grad_norm": 1.0852508544921875,
+      "learning_rate": 2.8931463200201893e-05,
+      "loss": 0.7472,
+      "step": 280
+    },
+    {
+      "epoch": 0.8444444444444444,
+      "grad_norm": 0.8442374467849731,
+      "learning_rate": 2.8876381103326315e-05,
+      "loss": 0.7197,
+      "step": 285
+    },
+    {
+      "epoch": 0.8592592592592593,
+      "grad_norm": 0.9241513609886169,
+      "learning_rate": 2.881996989183762e-05,
+      "loss": 0.6262,
+      "step": 290
+    },
+    {
+      "epoch": 0.8740740740740741,
+      "grad_norm": 0.9735771417617798,
+      "learning_rate": 2.8762234968943242e-05,
+      "loss": 0.6872,
+      "step": 295
+    },
+    {
+      "epoch": 0.8888888888888888,
+      "grad_norm": 1.1165865659713745,
+      "learning_rate": 2.8703181864639013e-05,
+      "loss": 0.6681,
+      "step": 300
+    },
+    {
+      "epoch": 0.9037037037037037,
+      "grad_norm": 1.1977198123931885,
+      "learning_rate": 2.8642816235179497e-05,
+      "loss": 0.7009,
+      "step": 305
+    },
+    {
+      "epoch": 0.9185185185185185,
+      "grad_norm": 1.0150716304779053,
+      "learning_rate": 2.8581143862536195e-05,
+      "loss": 0.6847,
+      "step": 310
+    },
+    {
+      "epoch": 0.9333333333333333,
+      "grad_norm": 0.9897181987762451,
+      "learning_rate": 2.8518170653843775e-05,
+      "loss": 0.6415,
+      "step": 315
+    },
+    {
+      "epoch": 0.9481481481481482,
+      "grad_norm": 0.9338468313217163,
+      "learning_rate": 2.8453902640834232e-05,
+      "loss": 0.6915,
+      "step": 320
+    },
+    {
+      "epoch": 0.9629629629629629,
+      "grad_norm": 0.9977596402168274,
+      "learning_rate": 2.8388345979259168e-05,
+      "loss": 0.6448,
+      "step": 325
+    },
+    {
+      "epoch": 0.9777777777777777,
+      "grad_norm": 1.015538215637207,
+      "learning_rate": 2.8321506948300177e-05,
+      "loss": 0.6219,
+      "step": 330
+    },
+    {
+      "epoch": 0.9925925925925926,
+      "grad_norm": 1.038949966430664,
+      "learning_rate": 2.825339194996743e-05,
+      "loss": 0.631,
+      "step": 335
+    },
+    {
+      "epoch": 1.005925925925926,
+      "grad_norm": 0.8998421430587769,
+      "learning_rate": 2.8184007508486434e-05,
+      "loss": 0.5823,
+      "step": 340
+    },
+    {
+      "epoch": 1.0207407407407407,
+      "grad_norm": 1.0664684772491455,
+      "learning_rate": 2.8113360269673154e-05,
+      "loss": 0.5729,
+      "step": 345
+    },
+    {
+      "epoch": 1.0355555555555556,
+      "grad_norm": 1.047934889793396,
+      "learning_rate": 2.8041457000297456e-05,
+      "loss": 0.5202,
+      "step": 350
+    },
+    {
+      "epoch": 1.0503703703703704,
+      "grad_norm": 0.9848654866218567,
+      "learning_rate": 2.7968304587434973e-05,
+      "loss": 0.5329,
+      "step": 355
+    },
+    {
+      "epoch": 1.0651851851851852,
+      "grad_norm": 1.0740442276000977,
+      "learning_rate": 2.7893910037807415e-05,
+      "loss": 0.566,
+      "step": 360
+    },
+    {
+      "epoch": 1.08,
+      "grad_norm": 1.0978713035583496,
+      "learning_rate": 2.781828047711149e-05,
+      "loss": 0.5689,
+      "step": 365
+    },
+    {
+      "epoch": 1.094814814814815,
+      "grad_norm": 1.1654003858566284,
+      "learning_rate": 2.774142314933636e-05,
+      "loss": 0.543,
+      "step": 370
+    },
+    {
+      "epoch": 1.1096296296296297,
+      "grad_norm": 1.1164604425430298,
+      "learning_rate": 2.76633454160698e-05,
+      "loss": 0.4947,
+      "step": 375
+    },
+    {
+      "epoch": 1.1244444444444444,
+      "grad_norm": 1.0386937856674194,
+      "learning_rate": 2.758405475579308e-05,
+      "loss": 0.4964,
+      "step": 380
+    },
+    {
+      "epoch": 1.1392592592592592,
+      "grad_norm": 1.074188232421875,
+      "learning_rate": 2.750355876316467e-05,
+      "loss": 0.5535,
+      "step": 385
+    },
+    {
+      "epoch": 1.154074074074074,
+      "grad_norm": 1.0344929695129395,
+      "learning_rate": 2.7421865148292796e-05,
+      "loss": 0.5269,
+      "step": 390
+    },
+    {
+      "epoch": 1.1688888888888889,
+      "grad_norm": 0.9018165469169617,
+      "learning_rate": 2.733898173599695e-05,
+      "loss": 0.5389,
+      "step": 395
+    },
+    {
+      "epoch": 1.1837037037037037,
+      "grad_norm": 1.1409497261047363,
+      "learning_rate": 2.7254916465058408e-05,
+      "loss": 0.4876,
+      "step": 400
+    },
+    {
+      "epoch": 1.1985185185185185,
+      "grad_norm": 1.1169958114624023,
+      "learning_rate": 2.7169677387459835e-05,
+      "loss": 0.4854,
+      "step": 405
+    },
+    {
+      "epoch": 1.2133333333333334,
+      "grad_norm": 0.9902099967002869,
+      "learning_rate": 2.7083272667614034e-05,
+      "loss": 0.4844,
+      "step": 410
+    },
+    {
+      "epoch": 1.2281481481481482,
+      "grad_norm": 1.0552476644515991,
+      "learning_rate": 2.699571058158196e-05,
+      "loss": 0.516,
+      "step": 415
+    },
+    {
+      "epoch": 1.242962962962963,
+      "grad_norm": 1.1118648052215576,
+      "learning_rate": 2.6906999516280004e-05,
+      "loss": 0.4889,
+      "step": 420
+    },
+    {
+      "epoch": 1.2577777777777777,
+      "grad_norm": 1.1397099494934082,
+      "learning_rate": 2.681714796867667e-05,
+      "loss": 0.49,
+      "step": 425
+    },
+    {
+      "epoch": 1.2725925925925927,
+      "grad_norm": 1.164249300956726,
+      "learning_rate": 2.672616454497873e-05,
+      "loss": 0.4699,
+      "step": 430
+    },
+    {
+      "epoch": 1.2874074074074073,
+      "grad_norm": 1.187857747077942,
+      "learning_rate": 2.6634057959806872e-05,
+      "loss": 0.4833,
+      "step": 435
+    },
+    {
+      "epoch": 1.3022222222222222,
+      "grad_norm": 1.3279324769973755,
+      "learning_rate": 2.6540837035361033e-05,
+      "loss": 0.4913,
+      "step": 440
+    },
+    {
+      "epoch": 1.317037037037037,
+      "grad_norm": 1.1399012804031372,
+      "learning_rate": 2.6446510700575342e-05,
+      "loss": 0.4803,
+      "step": 445
+    },
+    {
+      "epoch": 1.3318518518518518,
+      "grad_norm": 1.2169464826583862,
+      "learning_rate": 2.6351087990262912e-05,
+      "loss": 0.4724,
+      "step": 450
+    },
+    {
+      "epoch": 1.3466666666666667,
+      "grad_norm": 1.213593602180481,
+      "learning_rate": 2.625457804425046e-05,
+      "loss": 0.4559,
+      "step": 455
+    },
+    {
+      "epoch": 1.3614814814814815,
+      "grad_norm": 1.121523141860962,
+      "learning_rate": 2.6156990106502863e-05,
+      "loss": 0.4625,
+      "step": 460
+    },
+    {
+      "epoch": 1.3762962962962964,
+      "grad_norm": 1.389193058013916,
+      "learning_rate": 2.6058333524237755e-05,
+      "loss": 0.4249,
+      "step": 465
+    },
+    {
+      "epoch": 1.3911111111111112,
+      "grad_norm": 1.2195489406585693,
+      "learning_rate": 2.595861774703022e-05,
+      "loss": 0.4754,
+      "step": 470
+    },
+    {
+      "epoch": 1.405925925925926,
+      "grad_norm": 1.196704626083374,
+      "learning_rate": 2.58578523259077e-05,
+      "loss": 0.4572,
+      "step": 475
+    },
+    {
+      "epoch": 1.4207407407407406,
+      "grad_norm": 1.095668911933899,
+      "learning_rate": 2.5756046912435158e-05,
+      "loss": 0.4805,
+      "step": 480
+    },
+    {
+      "epoch": 1.4355555555555555,
+      "grad_norm": 1.2372747659683228,
+      "learning_rate": 2.5653211257790636e-05,
+      "loss": 0.4113,
+      "step": 485
+    },
+    {
+      "epoch": 1.4503703703703703,
+      "grad_norm": 1.2018955945968628,
+      "learning_rate": 2.5549355211831265e-05,
+      "loss": 0.5064,
+      "step": 490
+    },
+    {
+      "epoch": 1.4651851851851851,
+      "grad_norm": 1.1164575815200806,
+      "learning_rate": 2.5444488722149812e-05,
+      "loss": 0.4418,
+      "step": 495
+    },
+    {
+      "epoch": 1.48,
+      "grad_norm": 1.0810885429382324,
+      "learning_rate": 2.533862183312189e-05,
+      "loss": 0.4304,
+      "step": 500
+    },
+    {
+      "epoch": 1.4948148148148148,
+      "grad_norm": 1.0699983835220337,
+      "learning_rate": 2.5231764684943865e-05,
+      "loss": 0.395,
+      "step": 505
+    },
+    {
+      "epoch": 1.5096296296296297,
+      "grad_norm": 1.0887062549591064,
+      "learning_rate": 2.5123927512661605e-05,
+      "loss": 0.4078,
+      "step": 510
+    },
+    {
+      "epoch": 1.5244444444444445,
+      "grad_norm": 1.0435563325881958,
+      "learning_rate": 2.5015120645190158e-05,
+      "loss": 0.4214,
+      "step": 515
+    },
+    {
+      "epoch": 1.5392592592592593,
+      "grad_norm": 0.9908123016357422,
+      "learning_rate": 2.4905354504324404e-05,
+      "loss": 0.4122,
+      "step": 520
+    },
+    {
+      "epoch": 1.554074074074074,
+      "grad_norm": 1.1591845750808716,
+      "learning_rate": 2.4794639603740844e-05,
+      "loss": 0.3957,
+      "step": 525
+    },
+    {
+      "epoch": 1.568888888888889,
+      "grad_norm": 1.1682777404785156,
+      "learning_rate": 2.4682986547990553e-05,
+      "loss": 0.4238,
+      "step": 530
+    },
+    {
+      "epoch": 1.5837037037037036,
+      "grad_norm": 1.0811845064163208,
+      "learning_rate": 2.4570406031483474e-05,
+      "loss": 0.408,
+      "step": 535
+    },
+    {
+      "epoch": 1.5985185185185187,
+      "grad_norm": 0.9767659306526184,
+      "learning_rate": 2.445690883746407e-05,
+      "loss": 0.3869,
+      "step": 540
+    },
+    {
+      "epoch": 1.6133333333333333,
+      "grad_norm": 1.1874679327011108,
+      "learning_rate": 2.4342505836978463e-05,
+      "loss": 0.4176,
+      "step": 545
+    },
+    {
+      "epoch": 1.6281481481481481,
+      "grad_norm": 1.176788330078125,
+      "learning_rate": 2.422720798783321e-05,
+      "loss": 0.3843,
+      "step": 550
+    },
+    {
+      "epoch": 1.642962962962963,
+      "grad_norm": 1.0802936553955078,
+      "learning_rate": 2.411102633354571e-05,
+      "loss": 0.4016,
+      "step": 555
+    },
+    {
+      "epoch": 1.6577777777777778,
+      "grad_norm": 1.449876308441162,
+      "learning_rate": 2.3993972002286434e-05,
+      "loss": 0.4329,
+      "step": 560
+    },
+    {
+      "epoch": 1.6725925925925926,
+      "grad_norm": 1.415281891822815,
+      "learning_rate": 2.387605620581305e-05,
+      "loss": 0.4036,
+      "step": 565
+    },
+    {
+      "epoch": 1.6874074074074072,
+      "grad_norm": 1.1891310214996338,
+      "learning_rate": 2.3757290238396528e-05,
+      "loss": 0.4104,
+      "step": 570
+    },
+    {
+      "epoch": 1.7022222222222223,
+      "grad_norm": 1.1044912338256836,
+      "learning_rate": 2.3637685475739332e-05,
+      "loss": 0.4061,
+      "step": 575
+    },
+    {
+      "epoch": 1.717037037037037,
+      "grad_norm": 1.0706647634506226,
+      "learning_rate": 2.351725337388586e-05,
+      "loss": 0.3315,
+      "step": 580
+    },
+    {
+      "epoch": 1.731851851851852,
+      "grad_norm": 1.2027537822723389,
+      "learning_rate": 2.3396005468125116e-05,
+      "loss": 0.3624,
+      "step": 585
+    },
+    {
+      "epoch": 1.7466666666666666,
+      "grad_norm": 1.213278889656067,
+      "learning_rate": 2.327395337188585e-05,
+      "loss": 0.3812,
+      "step": 590
+    },
+    {
+      "epoch": 1.7614814814814816,
+      "grad_norm": 1.028390645980835,
+      "learning_rate": 2.3151108775624222e-05,
+      "loss": 0.3587,
+      "step": 595
+    },
+    {
+      "epoch": 1.7762962962962963,
+      "grad_norm": 1.1282511949539185,
+      "learning_rate": 2.3027483445704e-05,
+      "loss": 0.3558,
+      "step": 600
+    },
+    {
+      "epoch": 1.791111111111111,
+      "grad_norm": 1.2234516143798828,
+      "learning_rate": 2.2903089223269595e-05,
+      "loss": 0.3796,
+      "step": 605
+    },
+    {
+      "epoch": 1.805925925925926,
+      "grad_norm": 1.3621071577072144,
+      "learning_rate": 2.277793802311188e-05,
+      "loss": 0.3756,
+      "step": 610
+    },
+    {
+      "epoch": 1.8207407407407408,
+      "grad_norm": 1.1892695426940918,
+      "learning_rate": 2.265204183252694e-05,
+      "loss": 0.3773,
+      "step": 615
+    },
+    {
+      "epoch": 1.8355555555555556,
+      "grad_norm": 0.9955325126647949,
+      "learning_rate": 2.2525412710167933e-05,
+      "loss": 0.3434,
+      "step": 620
+    },
+    {
+      "epoch": 1.8503703703703702,
+      "grad_norm": 1.1191469430923462,
+      "learning_rate": 2.239806278489003e-05,
+      "loss": 0.3555,
+      "step": 625
+    },
+    {
+      "epoch": 1.8651851851851853,
+      "grad_norm": 1.1457566022872925,
+      "learning_rate": 2.2270004254588752e-05,
+      "loss": 0.3586,
+      "step": 630
+    },
+    {
+      "epoch": 1.88,
+      "grad_norm": 1.285786747932434,
+      "learning_rate": 2.2141249385031564e-05,
+      "loss": 0.3506,
+      "step": 635
+    },
+    {
+      "epoch": 1.894814814814815,
+      "grad_norm": 1.1033052206039429,
+      "learning_rate": 2.2011810508683057e-05,
+      "loss": 0.3766,
+      "step": 640
+    },
+    {
+      "epoch": 1.9096296296296296,
+      "grad_norm": 1.0902478694915771,
+      "learning_rate": 2.1881700023523712e-05,
+      "loss": 0.3366,
+      "step": 645
+    },
+    {
+      "epoch": 1.9244444444444444,
+      "grad_norm": 1.0783882141113281,
+      "learning_rate": 2.1750930391862396e-05,
+      "loss": 0.3426,
+      "step": 650
+    },
+    {
+      "epoch": 1.9392592592592592,
+      "grad_norm": 1.2069432735443115,
+      "learning_rate": 2.1619514139142665e-05,
+      "loss": 0.3662,
+      "step": 655
+    },
+    {
+      "epoch": 1.954074074074074,
+      "grad_norm": 1.0932831764221191,
+      "learning_rate": 2.1487463852743067e-05,
+      "loss": 0.3087,
+      "step": 660
+    },
+    {
+      "epoch": 1.968888888888889,
+      "grad_norm": 1.1563724279403687,
+      "learning_rate": 2.1354792180771507e-05,
+      "loss": 0.327,
+      "step": 665
+    },
+    {
+      "epoch": 1.9837037037037037,
+      "grad_norm": 1.4596143960952759,
+      "learning_rate": 2.1221511830853734e-05,
+      "loss": 0.3343,
+      "step": 670
+    },
+    {
+      "epoch": 1.9985185185185186,
+      "grad_norm": 1.1026556491851807,
+      "learning_rate": 2.108763556891621e-05,
+      "loss": 0.344,
+      "step": 675
+    },
+    {
+      "epoch": 2.011851851851852,
+      "grad_norm": 1.4919943809509277,
+      "learning_rate": 2.095317621796336e-05,
+      "loss": 0.2977,
+      "step": 680
+    },
+    {
+      "epoch": 2.026666666666667,
+      "grad_norm": 1.1148377656936646,
+      "learning_rate": 2.08181466568493e-05,
+      "loss": 0.263,
+      "step": 685
+    },
+    {
+      "epoch": 2.0414814814814815,
+      "grad_norm": 1.0882487297058105,
+      "learning_rate": 2.0682559819044348e-05,
+      "loss": 0.2404,
+      "step": 690
+    },
+    {
+      "epoch": 2.0562962962962965,
+      "grad_norm": 1.1826133728027344,
+      "learning_rate": 2.054642869139616e-05,
+      "loss": 0.25,
+      "step": 695
+    },
+    {
+      "epoch": 2.071111111111111,
+      "grad_norm": 1.1646047830581665,
+      "learning_rate": 2.0409766312885845e-05,
+      "loss": 0.2385,
+      "step": 700
+    },
+    {
+      "epoch": 2.0859259259259257,
+      "grad_norm": 1.1066701412200928,
+      "learning_rate": 2.0272585773379047e-05,
+      "loss": 0.2422,
+      "step": 705
+    },
+    {
+      "epoch": 2.100740740740741,
+      "grad_norm": 1.1488316059112549,
+      "learning_rate": 2.0134900212372183e-05,
+      "loss": 0.2411,
+      "step": 710
+    },
+    {
+      "epoch": 2.1155555555555554,
+      "grad_norm": 1.0931707620620728,
+      "learning_rate": 1.999672281773389e-05,
+      "loss": 0.2292,
+      "step": 715
+    },
+    {
+      "epoch": 2.1303703703703705,
+      "grad_norm": 1.1223219633102417,
+      "learning_rate": 1.985806682444186e-05,
+      "loss": 0.2389,
+      "step": 720
+    },
+    {
+      "epoch": 2.145185185185185,
+      "grad_norm": 1.2314571142196655,
+      "learning_rate": 1.9718945513315178e-05,
+      "loss": 0.2688,
+      "step": 725
+    },
+    {
+      "epoch": 2.16,
+      "grad_norm": 1.2058697938919067,
+      "learning_rate": 1.9579372209742218e-05,
+      "loss": 0.2214,
+      "step": 730
+    },
+    {
+      "epoch": 2.1748148148148148,
+      "grad_norm": 1.155304193496704,
+      "learning_rate": 1.9439360282404352e-05,
+      "loss": 0.2588,
+      "step": 735
+    },
+    {
+      "epoch": 2.18962962962963,
+      "grad_norm": 1.2274935245513916,
+      "learning_rate": 1.929892314199542e-05,
+      "loss": 0.2412,
+      "step": 740
+    },
+    {
+      "epoch": 2.2044444444444444,
+      "grad_norm": 0.9616653919219971,
+      "learning_rate": 1.9158074239937235e-05,
+      "loss": 0.2486,
+      "step": 745
+    },
+    {
+      "epoch": 2.2192592592592595,
+      "grad_norm": 1.364080786705017,
+      "learning_rate": 1.9016827067091187e-05,
+      "loss": 0.2025,
+      "step": 750
+    },
+    {
+      "epoch": 2.234074074074074,
+      "grad_norm": 1.1122071743011475,
+      "learning_rate": 1.887519515246604e-05,
+      "loss": 0.2353,
+      "step": 755
+    },
+    {
+      "epoch": 2.2488888888888887,
+      "grad_norm": 1.0747346878051758,
+      "learning_rate": 1.8733192061922073e-05,
+      "loss": 0.2361,
+      "step": 760
+    },
+    {
+      "epoch": 2.2637037037037038,
+      "grad_norm": 1.4977558851242065,
+      "learning_rate": 1.8590831396871744e-05,
+      "loss": 0.2525,
+      "step": 765
+    },
+    {
+      "epoch": 2.2785185185185184,
+      "grad_norm": 1.0749567747116089,
+      "learning_rate": 1.8448126792976902e-05,
+      "loss": 0.232,
+      "step": 770
+    },
+    {
+      "epoch": 2.2933333333333334,
+      "grad_norm": 1.0133754014968872,
+      "learning_rate": 1.8305091918842694e-05,
+      "loss": 0.223,
+      "step": 775
+    },
+    {
+      "epoch": 2.308148148148148,
+      "grad_norm": 1.0828489065170288,
+      "learning_rate": 1.8161740474708406e-05,
+      "loss": 0.2373,
+      "step": 780
+    },
+    {
+      "epoch": 2.322962962962963,
+      "grad_norm": 0.9887102842330933,
+      "learning_rate": 1.8018086191135178e-05,
+      "loss": 0.25,
+      "step": 785
+    },
+    {
+      "epoch": 2.3377777777777777,
+      "grad_norm": 1.0584267377853394,
+      "learning_rate": 1.7874142827690876e-05,
+      "loss": 0.2115,
+      "step": 790
+    },
+    {
+      "epoch": 2.3525925925925923,
+      "grad_norm": 1.1735175848007202,
+      "learning_rate": 1.772992417163217e-05,
+      "loss": 0.2585,
+      "step": 795
+    },
+    {
+      "epoch": 2.3674074074074074,
+      "grad_norm": 1.0981842279434204,
+      "learning_rate": 1.7585444036583932e-05,
+      "loss": 0.1952,
+      "step": 800
+    },
+    {
+      "epoch": 2.3822222222222225,
+      "grad_norm": 1.1940069198608398,
+      "learning_rate": 1.7440716261216153e-05,
+      "loss": 0.2112,
+      "step": 805
+    },
+    {
+      "epoch": 2.397037037037037,
+      "grad_norm": 1.1736091375350952,
+      "learning_rate": 1.729575470791845e-05,
+      "loss": 0.2387,
+      "step": 810
+    },
+    {
+      "epoch": 2.4118518518518517,
+      "grad_norm": 1.0778909921646118,
+      "learning_rate": 1.7150573261472258e-05,
+      "loss": 0.2405,
+      "step": 815
+    },
+    {
+      "epoch": 2.4266666666666667,
+      "grad_norm": 1.1269712448120117,
+      "learning_rate": 1.700518582772094e-05,
+      "loss": 0.2441,
+      "step": 820
+    },
+    {
+      "epoch": 2.4414814814814814,
+      "grad_norm": 1.0724108219146729,
+      "learning_rate": 1.685960633223783e-05,
+      "loss": 0.1992,
+      "step": 825
+    },
+    {
+      "epoch": 2.4562962962962964,
+      "grad_norm": 1.2265195846557617,
+      "learning_rate": 1.6713848718992432e-05,
+      "loss": 0.2364,
+      "step": 830
+    },
+    {
+      "epoch": 2.471111111111111,
+      "grad_norm": 1.1078436374664307,
+      "learning_rate": 1.6567926949014805e-05,
+      "loss": 0.2021,
+      "step": 835
+    },
+    {
+      "epoch": 2.485925925925926,
+      "grad_norm": 1.1599864959716797,
+      "learning_rate": 1.6421854999058353e-05,
+      "loss": 0.2103,
+      "step": 840
+    },
+    {
+      "epoch": 2.5007407407407407,
+      "grad_norm": 1.1391360759735107,
+      "learning_rate": 1.6275646860261098e-05,
+      "loss": 0.1882,
+      "step": 845
+    },
+    {
+      "epoch": 2.5155555555555553,
+      "grad_norm": 1.1846624612808228,
+      "learning_rate": 1.6129316536805574e-05,
+      "loss": 0.2195,
+      "step": 850
+    },
+    {
+      "epoch": 2.5303703703703704,
+      "grad_norm": 1.2152327299118042,
+      "learning_rate": 1.5982878044577466e-05,
+      "loss": 0.2067,
+      "step": 855
+    },
+    {
+      "epoch": 2.5451851851851854,
+      "grad_norm": 1.2093056440353394,
+      "learning_rate": 1.5836345409823125e-05,
+      "loss": 0.1918,
+      "step": 860
+    },
+    {
+      "epoch": 2.56,
+      "grad_norm": 1.2072885036468506,
+      "learning_rate": 1.5689732667806123e-05,
+      "loss": 0.2055,
+      "step": 865
+    },
+    {
+      "epoch": 2.5748148148148147,
+      "grad_norm": 1.0718040466308594,
+      "learning_rate": 1.554305386146291e-05,
+      "loss": 0.194,
+      "step": 870
+    },
+    {
+      "epoch": 2.5896296296296297,
+      "grad_norm": 1.1138250827789307,
+      "learning_rate": 1.5396323040057723e-05,
+      "loss": 0.2061,
+      "step": 875
+    },
+    {
+      "epoch": 2.6044444444444443,
+      "grad_norm": 1.2372530698776245,
+      "learning_rate": 1.5249554257836952e-05,
+      "loss": 0.2055,
+      "step": 880
+    },
+    {
+      "epoch": 2.6192592592592594,
+      "grad_norm": 1.1863386631011963,
+      "learning_rate": 1.5102761572682966e-05,
+      "loss": 0.2252,
+      "step": 885
+    },
+    {
+      "epoch": 2.634074074074074,
+      "grad_norm": 1.1669915914535522,
+      "learning_rate": 1.49559590447676e-05,
+      "loss": 0.1716,
+      "step": 890
+    },
+    {
+      "epoch": 2.648888888888889,
+      "grad_norm": 1.0519976615905762,
+      "learning_rate": 1.4809160735205475e-05,
+      "loss": 0.1935,
+      "step": 895
+    },
+    {
+      "epoch": 2.6637037037037037,
+      "grad_norm": 1.1324058771133423,
+      "learning_rate": 1.466238070470716e-05,
+      "loss": 0.1973,
+      "step": 900
+    },
+    {
+      "epoch": 2.6785185185185183,
+      "grad_norm": 1.1331405639648438,
+      "learning_rate": 1.45156330122324e-05,
+      "loss": 0.1847,
+      "step": 905
+    },
+    {
+      "epoch": 2.6933333333333334,
+      "grad_norm": 1.1123415231704712,
+      "learning_rate": 1.4368931713643537e-05,
+      "loss": 0.1887,
+      "step": 910
+    },
+    {
+      "epoch": 2.7081481481481484,
+      "grad_norm": 1.1478253602981567,
+      "learning_rate": 1.4222290860359187e-05,
+      "loss": 0.1948,
+      "step": 915
+    },
+    {
+      "epoch": 2.722962962962963,
+      "grad_norm": 0.990747332572937,
+      "learning_rate": 1.4075724498008353e-05,
+      "loss": 0.1802,
+      "step": 920
+    },
+    {
+      "epoch": 2.7377777777777776,
+      "grad_norm": 1.0164415836334229,
+      "learning_rate": 1.3929246665085118e-05,
+      "loss": 0.1695,
+      "step": 925
+    },
+    {
+      "epoch": 2.7525925925925927,
+      "grad_norm": 0.9575899839401245,
+      "learning_rate": 1.3782871391603998e-05,
+      "loss": 0.1941,
+      "step": 930
+    },
+    {
+      "epoch": 2.7674074074074073,
+      "grad_norm": 1.0679200887680054,
+      "learning_rate": 1.3636612697756096e-05,
+      "loss": 0.151,
+      "step": 935
+    },
+    {
+      "epoch": 2.7822222222222224,
+      "grad_norm": 1.1639066934585571,
+      "learning_rate": 1.3490484592566235e-05,
+      "loss": 0.1788,
+      "step": 940
+    },
+    {
+      "epoch": 2.797037037037037,
+      "grad_norm": 1.11581552028656,
+      "learning_rate": 1.334450107255113e-05,
+      "loss": 0.2031,
+      "step": 945
+    },
+    {
+      "epoch": 2.811851851851852,
+      "grad_norm": 1.2103270292282104,
+      "learning_rate": 1.3198676120378753e-05,
+      "loss": 0.1923,
+      "step": 950
+    },
+    {
+      "epoch": 2.8266666666666667,
+      "grad_norm": 1.1781346797943115,
+      "learning_rate": 1.305302370352906e-05,
+      "loss": 0.1778,
+      "step": 955
+    },
+    {
+      "epoch": 2.8414814814814813,
+      "grad_norm": 1.1030999422073364,
+      "learning_rate": 1.2907557772956146e-05,
+      "loss": 0.1686,
+      "step": 960
+    },
+    {
+      "epoch": 2.8562962962962963,
+      "grad_norm": 1.0762885808944702,
+      "learning_rate": 1.2762292261751964e-05,
+      "loss": 0.1856,
+      "step": 965
+    },
+    {
+      "epoch": 2.871111111111111,
+      "grad_norm": 1.2288093566894531,
+      "learning_rate": 1.2617241083811808e-05,
+      "loss": 0.205,
+      "step": 970
+    },
+    {
+      "epoch": 2.885925925925926,
+      "grad_norm": 1.2217600345611572,
+      "learning_rate": 1.2472418132501603e-05,
+      "loss": 0.1724,
+      "step": 975
+    },
+    {
+      "epoch": 2.9007407407407406,
+      "grad_norm": 1.0130177736282349,
+      "learning_rate": 1.2327837279327136e-05,
+      "loss": 0.1602,
+      "step": 980
+    },
+    {
+      "epoch": 2.9155555555555557,
+      "grad_norm": 1.0387743711471558,
+      "learning_rate": 1.2183512372605437e-05,
+      "loss": 0.1646,
+      "step": 985
+    },
+    {
+      "epoch": 2.9303703703703703,
+      "grad_norm": 1.0937505960464478,
+      "learning_rate": 1.2039457236138348e-05,
+      "loss": 0.1631,
+      "step": 990
+    },
+    {
+      "epoch": 2.9451851851851854,
+      "grad_norm": 1.168393611907959,
+      "learning_rate": 1.1895685667888422e-05,
+      "loss": 0.1658,
+      "step": 995
+    },
+    {
+      "epoch": 2.96,
+      "grad_norm": 1.0644232034683228,
+      "learning_rate": 1.1752211438657354e-05,
+      "loss": 0.158,
+      "step": 1000
+    },
+    {
+      "epoch": 2.974814814814815,
+      "grad_norm": 1.0772701501846313,
+      "learning_rate": 1.1609048290766953e-05,
+      "loss": 0.1545,
+      "step": 1005
+    },
+    {
+      "epoch": 2.9896296296296296,
+      "grad_norm": 1.0731050968170166,
+      "learning_rate": 1.146620993674287e-05,
+      "loss": 0.1719,
+      "step": 1010
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1690,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.4370912295641416e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

12_128_e5_3e-5/checkpoint-1014/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf3de353e55c6bf100f84167a13e5a6eca9560d421b12215d80a3ea91d88ef2e
+size 7736

12_128_e5_3e-5/checkpoint-1014/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1014/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

12_128_e5_3e-5/checkpoint-1352/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

12_128_e5_3e-5/checkpoint-1352/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "up_proj",
+    "k_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

12_128_e5_3e-5/checkpoint-1352/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c840347c1c8aa34906666420276a8a0b263dfb53c26256b47ff6529d7b8a4b0c
+size 791751704

12_128_e5_3e-5/checkpoint-1352/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1352

12_128_e5_3e-5/checkpoint-1352/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1352/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce20632b3de386e67d98825cea4f7459a9f97aecf7ea04028d6d6e7ac54edbc4
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:50aeda523c70a34c102359dc20bedf52996acfddbd7afb66c89fa37aac5fa8cb
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e69b4be8ad95e4b33f0162814d360870d9ecd87f6b5a31588c0d9d2d09ab045
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_3.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65105a05afbedb47a213c5ea8fc56493a07cf272b07b863f02efde57a87950c1
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_4.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12646d529bbf43ac33b7e33cc6589a38140af5b75bb53ecaee01305c637c519b
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_5.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:671ca0c759d63b958c9291b92673ce9ead929148d77378269f31d1d8f9e12f21
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_6.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d64052f4cf28b8aabe97c2a5acdd78e6ea98c6dcd376f7d05b1e77b56a99c90f
+size 15920

12_128_e5_3e-5/checkpoint-1352/rng_state_7.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0eaa73ab70c015715928a105bf2137d28526cab0f0ed59d78a79d1e5c8c81489
+size 15920

12_128_e5_3e-5/checkpoint-1352/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0715aa558ed5f468d923d58bb3a05a094e17300a545148922dc743902ab71386
+size 1064

12_128_e5_3e-5/checkpoint-1352/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<reponame>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

12_128_e5_3e-5/checkpoint-1352/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1352/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,188 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<reponame>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

12_128_e5_3e-5/checkpoint-1352/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1924 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "eval_steps": 500,
+  "global_step": 1352,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.014814814814814815,
+      "grad_norm": 1.205528974533081,
+      "learning_rate": 1.411764705882353e-06,
+      "loss": 1.3542,
+      "step": 5
+    },
+    {
+      "epoch": 0.02962962962962963,
+      "grad_norm": 0.9048640131950378,
+      "learning_rate": 3.1764705882352943e-06,
+      "loss": 1.2984,
+      "step": 10
+    },
+    {
+      "epoch": 0.044444444444444446,
+      "grad_norm": 0.7668255567550659,
+      "learning_rate": 4.941176470588235e-06,
+      "loss": 1.3011,
+      "step": 15
+    },
+    {
+      "epoch": 0.05925925925925926,
+      "grad_norm": 0.6333377957344055,
+      "learning_rate": 6.705882352941177e-06,
+      "loss": 1.2844,
+      "step": 20
+    },
+    {
+      "epoch": 0.07407407407407407,
+      "grad_norm": 0.7129969000816345,
+      "learning_rate": 8.470588235294118e-06,
+      "loss": 1.2659,
+      "step": 25
+    },
+    {
+      "epoch": 0.08888888888888889,
+      "grad_norm": 1.9554294347763062,
+      "learning_rate": 1.023529411764706e-05,
+      "loss": 1.267,
+      "step": 30
+    },
+    {
+      "epoch": 0.1037037037037037,
+      "grad_norm": 0.7532616853713989,
+      "learning_rate": 1.2e-05,
+      "loss": 1.1683,
+      "step": 35
+    },
+    {
+      "epoch": 0.11851851851851852,
+      "grad_norm": 0.5177083611488342,
+      "learning_rate": 1.3764705882352941e-05,
+      "loss": 1.2087,
+      "step": 40
+    },
+    {
+      "epoch": 0.13333333333333333,
+      "grad_norm": 0.48525315523147583,
+      "learning_rate": 1.5529411764705886e-05,
+      "loss": 1.2043,
+      "step": 45
+    },
+    {
+      "epoch": 0.14814814814814814,
+      "grad_norm": 0.4408872723579407,
+      "learning_rate": 1.7294117647058823e-05,
+      "loss": 1.1894,
+      "step": 50
+    },
+    {
+      "epoch": 0.16296296296296298,
+      "grad_norm": 0.684490442276001,
+      "learning_rate": 1.9058823529411764e-05,
+      "loss": 1.183,
+      "step": 55
+    },
+    {
+      "epoch": 0.17777777777777778,
+      "grad_norm": 0.4762347936630249,
+      "learning_rate": 2.0823529411764705e-05,
+      "loss": 1.192,
+      "step": 60
+    },
+    {
+      "epoch": 0.1925925925925926,
+      "grad_norm": 0.5843256115913391,
+      "learning_rate": 2.2588235294117646e-05,
+      "loss": 1.1925,
+      "step": 65
+    },
+    {
+      "epoch": 0.2074074074074074,
+      "grad_norm": 0.46570947766304016,
+      "learning_rate": 2.4352941176470587e-05,
+      "loss": 1.1338,
+      "step": 70
+    },
+    {
+      "epoch": 0.2222222222222222,
+      "grad_norm": 0.5429889559745789,
+      "learning_rate": 2.6117647058823532e-05,
+      "loss": 1.1232,
+      "step": 75
+    },
+    {
+      "epoch": 0.23703703703703705,
+      "grad_norm": 0.5680948495864868,
+      "learning_rate": 2.7882352941176473e-05,
+      "loss": 1.1593,
+      "step": 80
+    },
+    {
+      "epoch": 0.2518518518518518,
+      "grad_norm": 0.46284499764442444,
+      "learning_rate": 2.9647058823529414e-05,
+      "loss": 1.0776,
+      "step": 85
+    },
+    {
+      "epoch": 0.26666666666666666,
+      "grad_norm": 0.5861993432044983,
+      "learning_rate": 2.9999540242630432e-05,
+      "loss": 1.1084,
+      "step": 90
+    },
+    {
+      "epoch": 0.2814814814814815,
+      "grad_norm": 0.5396347641944885,
+      "learning_rate": 2.9997672526619356e-05,
+      "loss": 1.1303,
+      "step": 95
+    },
+    {
+      "epoch": 0.2962962962962963,
+      "grad_norm": 0.5758911967277527,
+      "learning_rate": 2.999436829588809e-05,
+      "loss": 1.1098,
+      "step": 100
+    },
+    {
+      "epoch": 0.3111111111111111,
+      "grad_norm": 0.547749936580658,
+      "learning_rate": 2.9989627866924146e-05,
+      "loss": 1.0519,
+      "step": 105
+    },
+    {
+      "epoch": 0.32592592592592595,
+      "grad_norm": 0.5769826769828796,
+      "learning_rate": 2.9983451693777715e-05,
+      "loss": 1.0643,
+      "step": 110
+    },
+    {
+      "epoch": 0.34074074074074073,
+      "grad_norm": 0.6840924620628357,
+      "learning_rate": 2.9975840368018158e-05,
+      "loss": 1.0418,
+      "step": 115
+    },
+    {
+      "epoch": 0.35555555555555557,
+      "grad_norm": 0.874555230140686,
+      "learning_rate": 2.9966794618677357e-05,
+      "loss": 1.0394,
+      "step": 120
+    },
+    {
+      "epoch": 0.37037037037037035,
+      "grad_norm": 0.6433928608894348,
+      "learning_rate": 2.99563153121799e-05,
+      "loss": 1.0176,
+      "step": 125
+    },
+    {
+      "epoch": 0.3851851851851852,
+      "grad_norm": 0.6158267855644226,
+      "learning_rate": 2.9944403452260055e-05,
+      "loss": 0.9896,
+      "step": 130
+    },
+    {
+      "epoch": 0.4,
+      "grad_norm": 0.7418085336685181,
+      "learning_rate": 2.9931060179865677e-05,
+      "loss": 1.0187,
+      "step": 135
+    },
+    {
+      "epoch": 0.4148148148148148,
+      "grad_norm": 0.625586986541748,
+      "learning_rate": 2.991628677304888e-05,
+      "loss": 0.9369,
+      "step": 140
+    },
+    {
+      "epoch": 0.42962962962962964,
+      "grad_norm": 0.7202538847923279,
+      "learning_rate": 2.990008464684366e-05,
+      "loss": 0.9627,
+      "step": 145
+    },
+    {
+      "epoch": 0.4444444444444444,
+      "grad_norm": 0.666614830493927,
+      "learning_rate": 2.9882455353130327e-05,
+      "loss": 0.9299,
+      "step": 150
+    },
+    {
+      "epoch": 0.45925925925925926,
+      "grad_norm": 0.6723878979682922,
+      "learning_rate": 2.9863400580486884e-05,
+      "loss": 0.9275,
+      "step": 155
+    },
+    {
+      "epoch": 0.4740740740740741,
+      "grad_norm": 0.6323536038398743,
+      "learning_rate": 2.984292215402729e-05,
+      "loss": 0.8664,
+      "step": 160
+    },
+    {
+      "epoch": 0.4888888888888889,
+      "grad_norm": 0.837838888168335,
+      "learning_rate": 2.982102203522663e-05,
+      "loss": 0.8963,
+      "step": 165
+    },
+    {
+      "epoch": 0.5037037037037037,
+      "grad_norm": 0.7143777012825012,
+      "learning_rate": 2.9797702321733254e-05,
+      "loss": 0.8955,
+      "step": 170
+    },
+    {
+      "epoch": 0.5185185185185185,
+      "grad_norm": 0.7509482502937317,
+      "learning_rate": 2.9772965247167855e-05,
+      "loss": 0.9471,
+      "step": 175
+    },
+    {
+      "epoch": 0.5333333333333333,
+      "grad_norm": 0.7797493934631348,
+      "learning_rate": 2.974681318090953e-05,
+      "loss": 0.8739,
+      "step": 180
+    },
+    {
+      "epoch": 0.5481481481481482,
+      "grad_norm": 0.845409631729126,
+      "learning_rate": 2.9719248627868823e-05,
+      "loss": 0.8954,
+      "step": 185
+    },
+    {
+      "epoch": 0.562962962962963,
+      "grad_norm": 0.8383833765983582,
+      "learning_rate": 2.9690274228247825e-05,
+      "loss": 0.919,
+      "step": 190
+    },
+    {
+      "epoch": 0.5777777777777777,
+      "grad_norm": 0.8013415932655334,
+      "learning_rate": 2.9659892757287247e-05,
+      "loss": 0.8092,
+      "step": 195
+    },
+    {
+      "epoch": 0.5925925925925926,
+      "grad_norm": 0.7464110255241394,
+      "learning_rate": 2.9628107125000648e-05,
+      "loss": 0.8591,
+      "step": 200
+    },
+    {
+      "epoch": 0.6074074074074074,
+      "grad_norm": 0.9086791276931763,
+      "learning_rate": 2.959492037589567e-05,
+      "loss": 0.8159,
+      "step": 205
+    },
+    {
+      "epoch": 0.6222222222222222,
+      "grad_norm": 0.7836911678314209,
+      "learning_rate": 2.9560335688682443e-05,
+      "loss": 0.8523,
+      "step": 210
+    },
+    {
+      "epoch": 0.6370370370370371,
+      "grad_norm": 0.8023139834403992,
+      "learning_rate": 2.952435637596912e-05,
+      "loss": 0.8181,
+      "step": 215
+    },
+    {
+      "epoch": 0.6518518518518519,
+      "grad_norm": 0.9165554046630859,
+      "learning_rate": 2.9486985883944586e-05,
+      "loss": 0.8079,
+      "step": 220
+    },
+    {
+      "epoch": 0.6666666666666666,
+      "grad_norm": 0.8642603158950806,
+      "learning_rate": 2.944822779204837e-05,
+      "loss": 0.7844,
+      "step": 225
+    },
+    {
+      "epoch": 0.6814814814814815,
+      "grad_norm": 0.8418120741844177,
+      "learning_rate": 2.9408085812627797e-05,
+      "loss": 0.754,
+      "step": 230
+    },
+    {
+      "epoch": 0.6962962962962963,
+      "grad_norm": 0.8577683568000793,
+      "learning_rate": 2.9366563790582416e-05,
+      "loss": 0.8121,
+      "step": 235
+    },
+    {
+      "epoch": 0.7111111111111111,
+      "grad_norm": 0.8663591146469116,
+      "learning_rate": 2.932366570299573e-05,
+      "loss": 0.7656,
+      "step": 240
+    },
+    {
+      "epoch": 0.725925925925926,
+      "grad_norm": 0.8809316158294678,
+      "learning_rate": 2.927939565875424e-05,
+      "loss": 0.7573,
+      "step": 245
+    },
+    {
+      "epoch": 0.7407407407407407,
+      "grad_norm": 0.9155621528625488,
+      "learning_rate": 2.9233757898153907e-05,
+      "loss": 0.7946,
+      "step": 250
+    },
+    {
+      "epoch": 0.7555555555555555,
+      "grad_norm": 0.9761556386947632,
+      "learning_rate": 2.9186756792493996e-05,
+      "loss": 0.7504,
+      "step": 255
+    },
+    {
+      "epoch": 0.7703703703703704,
+      "grad_norm": 0.8892177939414978,
+      "learning_rate": 2.9138396843658383e-05,
+      "loss": 0.7275,
+      "step": 260
+    },
+    {
+      "epoch": 0.7851851851851852,
+      "grad_norm": 1.0526090860366821,
+      "learning_rate": 2.9088682683684363e-05,
+      "loss": 0.7361,
+      "step": 265
+    },
+    {
+      "epoch": 0.8,
+      "grad_norm": 0.9149707555770874,
+      "learning_rate": 2.9037619074318955e-05,
+      "loss": 0.6894,
+      "step": 270
+    },
+    {
+      "epoch": 0.8148148148148148,
+      "grad_norm": 0.9914748668670654,
+      "learning_rate": 2.8985210906562845e-05,
+      "loss": 0.6885,
+      "step": 275
+    },
+    {
+      "epoch": 0.8296296296296296,
+      "grad_norm": 1.0852508544921875,
+      "learning_rate": 2.8931463200201893e-05,
+      "loss": 0.7472,
+      "step": 280
+    },
+    {
+      "epoch": 0.8444444444444444,
+      "grad_norm": 0.8442374467849731,
+      "learning_rate": 2.8876381103326315e-05,
+      "loss": 0.7197,
+      "step": 285
+    },
+    {
+      "epoch": 0.8592592592592593,
+      "grad_norm": 0.9241513609886169,
+      "learning_rate": 2.881996989183762e-05,
+      "loss": 0.6262,
+      "step": 290
+    },
+    {
+      "epoch": 0.8740740740740741,
+      "grad_norm": 0.9735771417617798,
+      "learning_rate": 2.8762234968943242e-05,
+      "loss": 0.6872,
+      "step": 295
+    },
+    {
+      "epoch": 0.8888888888888888,
+      "grad_norm": 1.1165865659713745,
+      "learning_rate": 2.8703181864639013e-05,
+      "loss": 0.6681,
+      "step": 300
+    },
+    {
+      "epoch": 0.9037037037037037,
+      "grad_norm": 1.1977198123931885,
+      "learning_rate": 2.8642816235179497e-05,
+      "loss": 0.7009,
+      "step": 305
+    },
+    {
+      "epoch": 0.9185185185185185,
+      "grad_norm": 1.0150716304779053,
+      "learning_rate": 2.8581143862536195e-05,
+      "loss": 0.6847,
+      "step": 310
+    },
+    {
+      "epoch": 0.9333333333333333,
+      "grad_norm": 0.9897181987762451,
+      "learning_rate": 2.8518170653843775e-05,
+      "loss": 0.6415,
+      "step": 315
+    },
+    {
+      "epoch": 0.9481481481481482,
+      "grad_norm": 0.9338468313217163,
+      "learning_rate": 2.8453902640834232e-05,
+      "loss": 0.6915,
+      "step": 320
+    },
+    {
+      "epoch": 0.9629629629629629,
+      "grad_norm": 0.9977596402168274,
+      "learning_rate": 2.8388345979259168e-05,
+      "loss": 0.6448,
+      "step": 325
+    },
+    {
+      "epoch": 0.9777777777777777,
+      "grad_norm": 1.015538215637207,
+      "learning_rate": 2.8321506948300177e-05,
+      "loss": 0.6219,
+      "step": 330
+    },
+    {
+      "epoch": 0.9925925925925926,
+      "grad_norm": 1.038949966430664,
+      "learning_rate": 2.825339194996743e-05,
+      "loss": 0.631,
+      "step": 335
+    },
+    {
+      "epoch": 1.005925925925926,
+      "grad_norm": 0.8998421430587769,
+      "learning_rate": 2.8184007508486434e-05,
+      "loss": 0.5823,
+      "step": 340
+    },
+    {
+      "epoch": 1.0207407407407407,
+      "grad_norm": 1.0664684772491455,
+      "learning_rate": 2.8113360269673154e-05,
+      "loss": 0.5729,
+      "step": 345
+    },
+    {
+      "epoch": 1.0355555555555556,
+      "grad_norm": 1.047934889793396,
+      "learning_rate": 2.8041457000297456e-05,
+      "loss": 0.5202,
+      "step": 350
+    },
+    {
+      "epoch": 1.0503703703703704,
+      "grad_norm": 0.9848654866218567,
+      "learning_rate": 2.7968304587434973e-05,
+      "loss": 0.5329,
+      "step": 355
+    },
+    {
+      "epoch": 1.0651851851851852,
+      "grad_norm": 1.0740442276000977,
+      "learning_rate": 2.7893910037807415e-05,
+      "loss": 0.566,
+      "step": 360
+    },
+    {
+      "epoch": 1.08,
+      "grad_norm": 1.0978713035583496,
+      "learning_rate": 2.781828047711149e-05,
+      "loss": 0.5689,
+      "step": 365
+    },
+    {
+      "epoch": 1.094814814814815,
+      "grad_norm": 1.1654003858566284,
+      "learning_rate": 2.774142314933636e-05,
+      "loss": 0.543,
+      "step": 370
+    },
+    {
+      "epoch": 1.1096296296296297,
+      "grad_norm": 1.1164604425430298,
+      "learning_rate": 2.76633454160698e-05,
+      "loss": 0.4947,
+      "step": 375
+    },
+    {
+      "epoch": 1.1244444444444444,
+      "grad_norm": 1.0386937856674194,
+      "learning_rate": 2.758405475579308e-05,
+      "loss": 0.4964,
+      "step": 380
+    },
+    {
+      "epoch": 1.1392592592592592,
+      "grad_norm": 1.074188232421875,
+      "learning_rate": 2.750355876316467e-05,
+      "loss": 0.5535,
+      "step": 385
+    },
+    {
+      "epoch": 1.154074074074074,
+      "grad_norm": 1.0344929695129395,
+      "learning_rate": 2.7421865148292796e-05,
+      "loss": 0.5269,
+      "step": 390
+    },
+    {
+      "epoch": 1.1688888888888889,
+      "grad_norm": 0.9018165469169617,
+      "learning_rate": 2.733898173599695e-05,
+      "loss": 0.5389,
+      "step": 395
+    },
+    {
+      "epoch": 1.1837037037037037,
+      "grad_norm": 1.1409497261047363,
+      "learning_rate": 2.7254916465058408e-05,
+      "loss": 0.4876,
+      "step": 400
+    },
+    {
+      "epoch": 1.1985185185185185,
+      "grad_norm": 1.1169958114624023,
+      "learning_rate": 2.7169677387459835e-05,
+      "loss": 0.4854,
+      "step": 405
+    },
+    {
+      "epoch": 1.2133333333333334,
+      "grad_norm": 0.9902099967002869,
+      "learning_rate": 2.7083272667614034e-05,
+      "loss": 0.4844,
+      "step": 410
+    },
+    {
+      "epoch": 1.2281481481481482,
+      "grad_norm": 1.0552476644515991,
+      "learning_rate": 2.699571058158196e-05,
+      "loss": 0.516,
+      "step": 415
+    },
+    {
+      "epoch": 1.242962962962963,
+      "grad_norm": 1.1118648052215576,
+      "learning_rate": 2.6906999516280004e-05,
+      "loss": 0.4889,
+      "step": 420
+    },
+    {
+      "epoch": 1.2577777777777777,
+      "grad_norm": 1.1397099494934082,
+      "learning_rate": 2.681714796867667e-05,
+      "loss": 0.49,
+      "step": 425
+    },
+    {
+      "epoch": 1.2725925925925927,
+      "grad_norm": 1.164249300956726,
+      "learning_rate": 2.672616454497873e-05,
+      "loss": 0.4699,
+      "step": 430
+    },
+    {
+      "epoch": 1.2874074074074073,
+      "grad_norm": 1.187857747077942,
+      "learning_rate": 2.6634057959806872e-05,
+      "loss": 0.4833,
+      "step": 435
+    },
+    {
+      "epoch": 1.3022222222222222,
+      "grad_norm": 1.3279324769973755,
+      "learning_rate": 2.6540837035361033e-05,
+      "loss": 0.4913,
+      "step": 440
+    },
+    {
+      "epoch": 1.317037037037037,
+      "grad_norm": 1.1399012804031372,
+      "learning_rate": 2.6446510700575342e-05,
+      "loss": 0.4803,
+      "step": 445
+    },
+    {
+      "epoch": 1.3318518518518518,
+      "grad_norm": 1.2169464826583862,
+      "learning_rate": 2.6351087990262912e-05,
+      "loss": 0.4724,
+      "step": 450
+    },
+    {
+      "epoch": 1.3466666666666667,
+      "grad_norm": 1.213593602180481,
+      "learning_rate": 2.625457804425046e-05,
+      "loss": 0.4559,
+      "step": 455
+    },
+    {
+      "epoch": 1.3614814814814815,
+      "grad_norm": 1.121523141860962,
+      "learning_rate": 2.6156990106502863e-05,
+      "loss": 0.4625,
+      "step": 460
+    },
+    {
+      "epoch": 1.3762962962962964,
+      "grad_norm": 1.389193058013916,
+      "learning_rate": 2.6058333524237755e-05,
+      "loss": 0.4249,
+      "step": 465
+    },
+    {
+      "epoch": 1.3911111111111112,
+      "grad_norm": 1.2195489406585693,
+      "learning_rate": 2.595861774703022e-05,
+      "loss": 0.4754,
+      "step": 470
+    },
+    {
+      "epoch": 1.405925925925926,
+      "grad_norm": 1.196704626083374,
+      "learning_rate": 2.58578523259077e-05,
+      "loss": 0.4572,
+      "step": 475
+    },
+    {
+      "epoch": 1.4207407407407406,
+      "grad_norm": 1.095668911933899,
+      "learning_rate": 2.5756046912435158e-05,
+      "loss": 0.4805,
+      "step": 480
+    },
+    {
+      "epoch": 1.4355555555555555,
+      "grad_norm": 1.2372747659683228,
+      "learning_rate": 2.5653211257790636e-05,
+      "loss": 0.4113,
+      "step": 485
+    },
+    {
+      "epoch": 1.4503703703703703,
+      "grad_norm": 1.2018955945968628,
+      "learning_rate": 2.5549355211831265e-05,
+      "loss": 0.5064,
+      "step": 490
+    },
+    {
+      "epoch": 1.4651851851851851,
+      "grad_norm": 1.1164575815200806,
+      "learning_rate": 2.5444488722149812e-05,
+      "loss": 0.4418,
+      "step": 495
+    },
+    {
+      "epoch": 1.48,
+      "grad_norm": 1.0810885429382324,
+      "learning_rate": 2.533862183312189e-05,
+      "loss": 0.4304,
+      "step": 500
+    },
+    {
+      "epoch": 1.4948148148148148,
+      "grad_norm": 1.0699983835220337,
+      "learning_rate": 2.5231764684943865e-05,
+      "loss": 0.395,
+      "step": 505
+    },
+    {
+      "epoch": 1.5096296296296297,
+      "grad_norm": 1.0887062549591064,
+      "learning_rate": 2.5123927512661605e-05,
+      "loss": 0.4078,
+      "step": 510
+    },
+    {
+      "epoch": 1.5244444444444445,
+      "grad_norm": 1.0435563325881958,
+      "learning_rate": 2.5015120645190158e-05,
+      "loss": 0.4214,
+      "step": 515
+    },
+    {
+      "epoch": 1.5392592592592593,
+      "grad_norm": 0.9908123016357422,
+      "learning_rate": 2.4905354504324404e-05,
+      "loss": 0.4122,
+      "step": 520
+    },
+    {
+      "epoch": 1.554074074074074,
+      "grad_norm": 1.1591845750808716,
+      "learning_rate": 2.4794639603740844e-05,
+      "loss": 0.3957,
+      "step": 525
+    },
+    {
+      "epoch": 1.568888888888889,
+      "grad_norm": 1.1682777404785156,
+      "learning_rate": 2.4682986547990553e-05,
+      "loss": 0.4238,
+      "step": 530
+    },
+    {
+      "epoch": 1.5837037037037036,
+      "grad_norm": 1.0811845064163208,
+      "learning_rate": 2.4570406031483474e-05,
+      "loss": 0.408,
+      "step": 535
+    },
+    {
+      "epoch": 1.5985185185185187,
+      "grad_norm": 0.9767659306526184,
+      "learning_rate": 2.445690883746407e-05,
+      "loss": 0.3869,
+      "step": 540
+    },
+    {
+      "epoch": 1.6133333333333333,
+      "grad_norm": 1.1874679327011108,
+      "learning_rate": 2.4342505836978463e-05,
+      "loss": 0.4176,
+      "step": 545
+    },
+    {
+      "epoch": 1.6281481481481481,
+      "grad_norm": 1.176788330078125,
+      "learning_rate": 2.422720798783321e-05,
+      "loss": 0.3843,
+      "step": 550
+    },
+    {
+      "epoch": 1.642962962962963,
+      "grad_norm": 1.0802936553955078,
+      "learning_rate": 2.411102633354571e-05,
+      "loss": 0.4016,
+      "step": 555
+    },
+    {
+      "epoch": 1.6577777777777778,
+      "grad_norm": 1.449876308441162,
+      "learning_rate": 2.3993972002286434e-05,
+      "loss": 0.4329,
+      "step": 560
+    },
+    {
+      "epoch": 1.6725925925925926,
+      "grad_norm": 1.415281891822815,
+      "learning_rate": 2.387605620581305e-05,
+      "loss": 0.4036,
+      "step": 565
+    },
+    {
+      "epoch": 1.6874074074074072,
+      "grad_norm": 1.1891310214996338,
+      "learning_rate": 2.3757290238396528e-05,
+      "loss": 0.4104,
+      "step": 570
+    },
+    {
+      "epoch": 1.7022222222222223,
+      "grad_norm": 1.1044912338256836,
+      "learning_rate": 2.3637685475739332e-05,
+      "loss": 0.4061,
+      "step": 575
+    },
+    {
+      "epoch": 1.717037037037037,
+      "grad_norm": 1.0706647634506226,
+      "learning_rate": 2.351725337388586e-05,
+      "loss": 0.3315,
+      "step": 580
+    },
+    {
+      "epoch": 1.731851851851852,
+      "grad_norm": 1.2027537822723389,
+      "learning_rate": 2.3396005468125116e-05,
+      "loss": 0.3624,
+      "step": 585
+    },
+    {
+      "epoch": 1.7466666666666666,
+      "grad_norm": 1.213278889656067,
+      "learning_rate": 2.327395337188585e-05,
+      "loss": 0.3812,
+      "step": 590
+    },
+    {
+      "epoch": 1.7614814814814816,
+      "grad_norm": 1.028390645980835,
+      "learning_rate": 2.3151108775624222e-05,
+      "loss": 0.3587,
+      "step": 595
+    },
+    {
+      "epoch": 1.7762962962962963,
+      "grad_norm": 1.1282511949539185,
+      "learning_rate": 2.3027483445704e-05,
+      "loss": 0.3558,
+      "step": 600
+    },
+    {
+      "epoch": 1.791111111111111,
+      "grad_norm": 1.2234516143798828,
+      "learning_rate": 2.2903089223269595e-05,
+      "loss": 0.3796,
+      "step": 605
+    },
+    {
+      "epoch": 1.805925925925926,
+      "grad_norm": 1.3621071577072144,
+      "learning_rate": 2.277793802311188e-05,
+      "loss": 0.3756,
+      "step": 610
+    },
+    {
+      "epoch": 1.8207407407407408,
+      "grad_norm": 1.1892695426940918,
+      "learning_rate": 2.265204183252694e-05,
+      "loss": 0.3773,
+      "step": 615
+    },
+    {
+      "epoch": 1.8355555555555556,
+      "grad_norm": 0.9955325126647949,
+      "learning_rate": 2.2525412710167933e-05,
+      "loss": 0.3434,
+      "step": 620
+    },
+    {
+      "epoch": 1.8503703703703702,
+      "grad_norm": 1.1191469430923462,
+      "learning_rate": 2.239806278489003e-05,
+      "loss": 0.3555,
+      "step": 625
+    },
+    {
+      "epoch": 1.8651851851851853,
+      "grad_norm": 1.1457566022872925,
+      "learning_rate": 2.2270004254588752e-05,
+      "loss": 0.3586,
+      "step": 630
+    },
+    {
+      "epoch": 1.88,
+      "grad_norm": 1.285786747932434,
+      "learning_rate": 2.2141249385031564e-05,
+      "loss": 0.3506,
+      "step": 635
+    },
+    {
+      "epoch": 1.894814814814815,
+      "grad_norm": 1.1033052206039429,
+      "learning_rate": 2.2011810508683057e-05,
+      "loss": 0.3766,
+      "step": 640
+    },
+    {
+      "epoch": 1.9096296296296296,
+      "grad_norm": 1.0902478694915771,
+      "learning_rate": 2.1881700023523712e-05,
+      "loss": 0.3366,
+      "step": 645
+    },
+    {
+      "epoch": 1.9244444444444444,
+      "grad_norm": 1.0783882141113281,
+      "learning_rate": 2.1750930391862396e-05,
+      "loss": 0.3426,
+      "step": 650
+    },
+    {
+      "epoch": 1.9392592592592592,
+      "grad_norm": 1.2069432735443115,
+      "learning_rate": 2.1619514139142665e-05,
+      "loss": 0.3662,
+      "step": 655
+    },
+    {
+      "epoch": 1.954074074074074,
+      "grad_norm": 1.0932831764221191,
+      "learning_rate": 2.1487463852743067e-05,
+      "loss": 0.3087,
+      "step": 660
+    },
+    {
+      "epoch": 1.968888888888889,
+      "grad_norm": 1.1563724279403687,
+      "learning_rate": 2.1354792180771507e-05,
+      "loss": 0.327,
+      "step": 665
+    },
+    {
+      "epoch": 1.9837037037037037,
+      "grad_norm": 1.4596143960952759,
+      "learning_rate": 2.1221511830853734e-05,
+      "loss": 0.3343,
+      "step": 670
+    },
+    {
+      "epoch": 1.9985185185185186,
+      "grad_norm": 1.1026556491851807,
+      "learning_rate": 2.108763556891621e-05,
+      "loss": 0.344,
+      "step": 675
+    },
+    {
+      "epoch": 2.011851851851852,
+      "grad_norm": 1.4919943809509277,
+      "learning_rate": 2.095317621796336e-05,
+      "loss": 0.2977,
+      "step": 680
+    },
+    {
+      "epoch": 2.026666666666667,
+      "grad_norm": 1.1148377656936646,
+      "learning_rate": 2.08181466568493e-05,
+      "loss": 0.263,
+      "step": 685
+    },
+    {
+      "epoch": 2.0414814814814815,
+      "grad_norm": 1.0882487297058105,
+      "learning_rate": 2.0682559819044348e-05,
+      "loss": 0.2404,
+      "step": 690
+    },
+    {
+      "epoch": 2.0562962962962965,
+      "grad_norm": 1.1826133728027344,
+      "learning_rate": 2.054642869139616e-05,
+      "loss": 0.25,
+      "step": 695
+    },
+    {
+      "epoch": 2.071111111111111,
+      "grad_norm": 1.1646047830581665,
+      "learning_rate": 2.0409766312885845e-05,
+      "loss": 0.2385,
+      "step": 700
+    },
+    {
+      "epoch": 2.0859259259259257,
+      "grad_norm": 1.1066701412200928,
+      "learning_rate": 2.0272585773379047e-05,
+      "loss": 0.2422,
+      "step": 705
+    },
+    {
+      "epoch": 2.100740740740741,
+      "grad_norm": 1.1488316059112549,
+      "learning_rate": 2.0134900212372183e-05,
+      "loss": 0.2411,
+      "step": 710
+    },
+    {
+      "epoch": 2.1155555555555554,
+      "grad_norm": 1.0931707620620728,
+      "learning_rate": 1.999672281773389e-05,
+      "loss": 0.2292,
+      "step": 715
+    },
+    {
+      "epoch": 2.1303703703703705,
+      "grad_norm": 1.1223219633102417,
+      "learning_rate": 1.985806682444186e-05,
+      "loss": 0.2389,
+      "step": 720
+    },
+    {
+      "epoch": 2.145185185185185,
+      "grad_norm": 1.2314571142196655,
+      "learning_rate": 1.9718945513315178e-05,
+      "loss": 0.2688,
+      "step": 725
+    },
+    {
+      "epoch": 2.16,
+      "grad_norm": 1.2058697938919067,
+      "learning_rate": 1.9579372209742218e-05,
+      "loss": 0.2214,
+      "step": 730
+    },
+    {
+      "epoch": 2.1748148148148148,
+      "grad_norm": 1.155304193496704,
+      "learning_rate": 1.9439360282404352e-05,
+      "loss": 0.2588,
+      "step": 735
+    },
+    {
+      "epoch": 2.18962962962963,
+      "grad_norm": 1.2274935245513916,
+      "learning_rate": 1.929892314199542e-05,
+      "loss": 0.2412,
+      "step": 740
+    },
+    {
+      "epoch": 2.2044444444444444,
+      "grad_norm": 0.9616653919219971,
+      "learning_rate": 1.9158074239937235e-05,
+      "loss": 0.2486,
+      "step": 745
+    },
+    {
+      "epoch": 2.2192592592592595,
+      "grad_norm": 1.364080786705017,
+      "learning_rate": 1.9016827067091187e-05,
+      "loss": 0.2025,
+      "step": 750
+    },
+    {
+      "epoch": 2.234074074074074,
+      "grad_norm": 1.1122071743011475,
+      "learning_rate": 1.887519515246604e-05,
+      "loss": 0.2353,
+      "step": 755
+    },
+    {
+      "epoch": 2.2488888888888887,
+      "grad_norm": 1.0747346878051758,
+      "learning_rate": 1.8733192061922073e-05,
+      "loss": 0.2361,
+      "step": 760
+    },
+    {
+      "epoch": 2.2637037037037038,
+      "grad_norm": 1.4977558851242065,
+      "learning_rate": 1.8590831396871744e-05,
+      "loss": 0.2525,
+      "step": 765
+    },
+    {
+      "epoch": 2.2785185185185184,
+      "grad_norm": 1.0749567747116089,
+      "learning_rate": 1.8448126792976902e-05,
+      "loss": 0.232,
+      "step": 770
+    },
+    {
+      "epoch": 2.2933333333333334,
+      "grad_norm": 1.0133754014968872,
+      "learning_rate": 1.8305091918842694e-05,
+      "loss": 0.223,
+      "step": 775
+    },
+    {
+      "epoch": 2.308148148148148,
+      "grad_norm": 1.0828489065170288,
+      "learning_rate": 1.8161740474708406e-05,
+      "loss": 0.2373,
+      "step": 780
+    },
+    {
+      "epoch": 2.322962962962963,
+      "grad_norm": 0.9887102842330933,
+      "learning_rate": 1.8018086191135178e-05,
+      "loss": 0.25,
+      "step": 785
+    },
+    {
+      "epoch": 2.3377777777777777,
+      "grad_norm": 1.0584267377853394,
+      "learning_rate": 1.7874142827690876e-05,
+      "loss": 0.2115,
+      "step": 790
+    },
+    {
+      "epoch": 2.3525925925925923,
+      "grad_norm": 1.1735175848007202,
+      "learning_rate": 1.772992417163217e-05,
+      "loss": 0.2585,
+      "step": 795
+    },
+    {
+      "epoch": 2.3674074074074074,
+      "grad_norm": 1.0981842279434204,
+      "learning_rate": 1.7585444036583932e-05,
+      "loss": 0.1952,
+      "step": 800
+    },
+    {
+      "epoch": 2.3822222222222225,
+      "grad_norm": 1.1940069198608398,
+      "learning_rate": 1.7440716261216153e-05,
+      "loss": 0.2112,
+      "step": 805
+    },
+    {
+      "epoch": 2.397037037037037,
+      "grad_norm": 1.1736091375350952,
+      "learning_rate": 1.729575470791845e-05,
+      "loss": 0.2387,
+      "step": 810
+    },
+    {
+      "epoch": 2.4118518518518517,
+      "grad_norm": 1.0778909921646118,
+      "learning_rate": 1.7150573261472258e-05,
+      "loss": 0.2405,
+      "step": 815
+    },
+    {
+      "epoch": 2.4266666666666667,
+      "grad_norm": 1.1269712448120117,
+      "learning_rate": 1.700518582772094e-05,
+      "loss": 0.2441,
+      "step": 820
+    },
+    {
+      "epoch": 2.4414814814814814,
+      "grad_norm": 1.0724108219146729,
+      "learning_rate": 1.685960633223783e-05,
+      "loss": 0.1992,
+      "step": 825
+    },
+    {
+      "epoch": 2.4562962962962964,
+      "grad_norm": 1.2265195846557617,
+      "learning_rate": 1.6713848718992432e-05,
+      "loss": 0.2364,
+      "step": 830
+    },
+    {
+      "epoch": 2.471111111111111,
+      "grad_norm": 1.1078436374664307,
+      "learning_rate": 1.6567926949014805e-05,
+      "loss": 0.2021,
+      "step": 835
+    },
+    {
+      "epoch": 2.485925925925926,
+      "grad_norm": 1.1599864959716797,
+      "learning_rate": 1.6421854999058353e-05,
+      "loss": 0.2103,
+      "step": 840
+    },
+    {
+      "epoch": 2.5007407407407407,
+      "grad_norm": 1.1391360759735107,
+      "learning_rate": 1.6275646860261098e-05,
+      "loss": 0.1882,
+      "step": 845
+    },
+    {
+      "epoch": 2.5155555555555553,
+      "grad_norm": 1.1846624612808228,
+      "learning_rate": 1.6129316536805574e-05,
+      "loss": 0.2195,
+      "step": 850
+    },
+    {
+      "epoch": 2.5303703703703704,
+      "grad_norm": 1.2152327299118042,
+      "learning_rate": 1.5982878044577466e-05,
+      "loss": 0.2067,
+      "step": 855
+    },
+    {
+      "epoch": 2.5451851851851854,
+      "grad_norm": 1.2093056440353394,
+      "learning_rate": 1.5836345409823125e-05,
+      "loss": 0.1918,
+      "step": 860
+    },
+    {
+      "epoch": 2.56,
+      "grad_norm": 1.2072885036468506,
+      "learning_rate": 1.5689732667806123e-05,
+      "loss": 0.2055,
+      "step": 865
+    },
+    {
+      "epoch": 2.5748148148148147,
+      "grad_norm": 1.0718040466308594,
+      "learning_rate": 1.554305386146291e-05,
+      "loss": 0.194,
+      "step": 870
+    },
+    {
+      "epoch": 2.5896296296296297,
+      "grad_norm": 1.1138250827789307,
+      "learning_rate": 1.5396323040057723e-05,
+      "loss": 0.2061,
+      "step": 875
+    },
+    {
+      "epoch": 2.6044444444444443,
+      "grad_norm": 1.2372530698776245,
+      "learning_rate": 1.5249554257836952e-05,
+      "loss": 0.2055,
+      "step": 880
+    },
+    {
+      "epoch": 2.6192592592592594,
+      "grad_norm": 1.1863386631011963,
+      "learning_rate": 1.5102761572682966e-05,
+      "loss": 0.2252,
+      "step": 885
+    },
+    {
+      "epoch": 2.634074074074074,
+      "grad_norm": 1.1669915914535522,
+      "learning_rate": 1.49559590447676e-05,
+      "loss": 0.1716,
+      "step": 890
+    },
+    {
+      "epoch": 2.648888888888889,
+      "grad_norm": 1.0519976615905762,
+      "learning_rate": 1.4809160735205475e-05,
+      "loss": 0.1935,
+      "step": 895
+    },
+    {
+      "epoch": 2.6637037037037037,
+      "grad_norm": 1.1324058771133423,
+      "learning_rate": 1.466238070470716e-05,
+      "loss": 0.1973,
+      "step": 900
+    },
+    {
+      "epoch": 2.6785185185185183,
+      "grad_norm": 1.1331405639648438,
+      "learning_rate": 1.45156330122324e-05,
+      "loss": 0.1847,
+      "step": 905
+    },
+    {
+      "epoch": 2.6933333333333334,
+      "grad_norm": 1.1123415231704712,
+      "learning_rate": 1.4368931713643537e-05,
+      "loss": 0.1887,
+      "step": 910
+    },
+    {
+      "epoch": 2.7081481481481484,
+      "grad_norm": 1.1478253602981567,
+      "learning_rate": 1.4222290860359187e-05,
+      "loss": 0.1948,
+      "step": 915
+    },
+    {
+      "epoch": 2.722962962962963,
+      "grad_norm": 0.990747332572937,
+      "learning_rate": 1.4075724498008353e-05,
+      "loss": 0.1802,
+      "step": 920
+    },
+    {
+      "epoch": 2.7377777777777776,
+      "grad_norm": 1.0164415836334229,
+      "learning_rate": 1.3929246665085118e-05,
+      "loss": 0.1695,
+      "step": 925
+    },
+    {
+      "epoch": 2.7525925925925927,
+      "grad_norm": 0.9575899839401245,
+      "learning_rate": 1.3782871391603998e-05,
+      "loss": 0.1941,
+      "step": 930
+    },
+    {
+      "epoch": 2.7674074074074073,
+      "grad_norm": 1.0679200887680054,
+      "learning_rate": 1.3636612697756096e-05,
+      "loss": 0.151,
+      "step": 935
+    },
+    {
+      "epoch": 2.7822222222222224,
+      "grad_norm": 1.1639066934585571,
+      "learning_rate": 1.3490484592566235e-05,
+      "loss": 0.1788,
+      "step": 940
+    },
+    {
+      "epoch": 2.797037037037037,
+      "grad_norm": 1.11581552028656,
+      "learning_rate": 1.334450107255113e-05,
+      "loss": 0.2031,
+      "step": 945
+    },
+    {
+      "epoch": 2.811851851851852,
+      "grad_norm": 1.2103270292282104,
+      "learning_rate": 1.3198676120378753e-05,
+      "loss": 0.1923,
+      "step": 950
+    },
+    {
+      "epoch": 2.8266666666666667,
+      "grad_norm": 1.1781346797943115,
+      "learning_rate": 1.305302370352906e-05,
+      "loss": 0.1778,
+      "step": 955
+    },
+    {
+      "epoch": 2.8414814814814813,
+      "grad_norm": 1.1030999422073364,
+      "learning_rate": 1.2907557772956146e-05,
+      "loss": 0.1686,
+      "step": 960
+    },
+    {
+      "epoch": 2.8562962962962963,
+      "grad_norm": 1.0762885808944702,
+      "learning_rate": 1.2762292261751964e-05,
+      "loss": 0.1856,
+      "step": 965
+    },
+    {
+      "epoch": 2.871111111111111,
+      "grad_norm": 1.2288093566894531,
+      "learning_rate": 1.2617241083811808e-05,
+      "loss": 0.205,
+      "step": 970
+    },
+    {
+      "epoch": 2.885925925925926,
+      "grad_norm": 1.2217600345611572,
+      "learning_rate": 1.2472418132501603e-05,
+      "loss": 0.1724,
+      "step": 975
+    },
+    {
+      "epoch": 2.9007407407407406,
+      "grad_norm": 1.0130177736282349,
+      "learning_rate": 1.2327837279327136e-05,
+      "loss": 0.1602,
+      "step": 980
+    },
+    {
+      "epoch": 2.9155555555555557,
+      "grad_norm": 1.0387743711471558,
+      "learning_rate": 1.2183512372605437e-05,
+      "loss": 0.1646,
+      "step": 985
+    },
+    {
+      "epoch": 2.9303703703703703,
+      "grad_norm": 1.0937505960464478,
+      "learning_rate": 1.2039457236138348e-05,
+      "loss": 0.1631,
+      "step": 990
+    },
+    {
+      "epoch": 2.9451851851851854,
+      "grad_norm": 1.168393611907959,
+      "learning_rate": 1.1895685667888422e-05,
+      "loss": 0.1658,
+      "step": 995
+    },
+    {
+      "epoch": 2.96,
+      "grad_norm": 1.0644232034683228,
+      "learning_rate": 1.1752211438657354e-05,
+      "loss": 0.158,
+      "step": 1000
+    },
+    {
+      "epoch": 2.974814814814815,
+      "grad_norm": 1.0772701501846313,
+      "learning_rate": 1.1609048290766953e-05,
+      "loss": 0.1545,
+      "step": 1005
+    },
+    {
+      "epoch": 2.9896296296296296,
+      "grad_norm": 1.0731050968170166,
+      "learning_rate": 1.146620993674287e-05,
+      "loss": 0.1719,
+      "step": 1010
+    },
+    {
+      "epoch": 3.002962962962963,
+      "grad_norm": 0.9637435674667358,
+      "learning_rate": 1.1323710058001198e-05,
+      "loss": 0.1557,
+      "step": 1015
+    },
+    {
+      "epoch": 3.017777777777778,
+      "grad_norm": 1.0189080238342285,
+      "learning_rate": 1.1181562303538013e-05,
+      "loss": 0.1441,
+      "step": 1020
+    },
+    {
+      "epoch": 3.0325925925925925,
+      "grad_norm": 0.9776983857154846,
+      "learning_rate": 1.1039780288622036e-05,
+      "loss": 0.117,
+      "step": 1025
+    },
+    {
+      "epoch": 3.0474074074074076,
+      "grad_norm": 0.9218119978904724,
+      "learning_rate": 1.0898377593490544e-05,
+      "loss": 0.127,
+      "step": 1030
+    },
+    {
+      "epoch": 3.062222222222222,
+      "grad_norm": 1.0749598741531372,
+      "learning_rate": 1.0757367762048613e-05,
+      "loss": 0.1177,
+      "step": 1035
+    },
+    {
+      "epoch": 3.0770370370370372,
+      "grad_norm": 1.0553370714187622,
+      "learning_rate": 1.0616764300571845e-05,
+      "loss": 0.1202,
+      "step": 1040
+    },
+    {
+      "epoch": 3.091851851851852,
+      "grad_norm": 1.0266081094741821,
+      "learning_rate": 1.0476580676412706e-05,
+      "loss": 0.1337,
+      "step": 1045
+    },
+    {
+      "epoch": 3.1066666666666665,
+      "grad_norm": 1.0148708820343018,
+      "learning_rate": 1.0336830316710602e-05,
+      "loss": 0.1216,
+      "step": 1050
+    },
+    {
+      "epoch": 3.1214814814814815,
+      "grad_norm": 1.0220842361450195,
+      "learning_rate": 1.0197526607105759e-05,
+      "loss": 0.123,
+      "step": 1055
+    },
+    {
+      "epoch": 3.136296296296296,
+      "grad_norm": 1.0517088174819946,
+      "learning_rate": 1.0058682890457153e-05,
+      "loss": 0.1252,
+      "step": 1060
+    },
+    {
+      "epoch": 3.151111111111111,
+      "grad_norm": 1.0555676221847534,
+      "learning_rate": 9.920312465564483e-06,
+      "loss": 0.1285,
+      "step": 1065
+    },
+    {
+      "epoch": 3.165925925925926,
+      "grad_norm": 0.8544878363609314,
+      "learning_rate": 9.782428585894356e-06,
+      "loss": 0.1167,
+      "step": 1070
+    },
+    {
+      "epoch": 3.180740740740741,
+      "grad_norm": 0.9497200846672058,
+      "learning_rate": 9.645044458310876e-06,
+      "loss": 0.1083,
+      "step": 1075
+    },
+    {
+      "epoch": 3.1955555555555555,
+      "grad_norm": 1.126343846321106,
+      "learning_rate": 9.508173241810635e-06,
+      "loss": 0.1182,
+      "step": 1080
+    },
+    {
+      "epoch": 3.2103703703703705,
+      "grad_norm": 1.0441088676452637,
+      "learning_rate": 9.371828046262299e-06,
+      "loss": 0.1153,
+      "step": 1085
+    },
+    {
+      "epoch": 3.225185185185185,
+      "grad_norm": 1.0101596117019653,
+      "learning_rate": 9.236021931150939e-06,
+      "loss": 0.1021,
+      "step": 1090
+    },
+    {
+      "epoch": 3.24,
+      "grad_norm": 0.9242001175880432,
+      "learning_rate": 9.100767904327153e-06,
+      "loss": 0.115,
+      "step": 1095
+    },
+    {
+      "epoch": 3.254814814814815,
+      "grad_norm": 1.100683569908142,
+      "learning_rate": 8.966078920761125e-06,
+      "loss": 0.1204,
+      "step": 1100
+    },
+    {
+      "epoch": 3.2696296296296294,
+      "grad_norm": 0.9212645292282104,
+      "learning_rate": 8.831967881301784e-06,
+      "loss": 0.1167,
+      "step": 1105
+    },
+    {
+      "epoch": 3.2844444444444445,
+      "grad_norm": 0.9732018709182739,
+      "learning_rate": 8.698447631441126e-06,
+      "loss": 0.1027,
+      "step": 1110
+    },
+    {
+      "epoch": 3.299259259259259,
+      "grad_norm": 0.9952080845832825,
+      "learning_rate": 8.565530960083822e-06,
+      "loss": 0.1068,
+      "step": 1115
+    },
+    {
+      "epoch": 3.314074074074074,
+      "grad_norm": 0.8855594396591187,
+      "learning_rate": 8.433230598322295e-06,
+      "loss": 0.0942,
+      "step": 1120
+    },
+    {
+      "epoch": 3.328888888888889,
+      "grad_norm": 0.9892774224281311,
+      "learning_rate": 8.301559218217278e-06,
+      "loss": 0.1263,
+      "step": 1125
+    },
+    {
+      "epoch": 3.343703703703704,
+      "grad_norm": 0.8043313026428223,
+      "learning_rate": 8.170529431584073e-06,
+      "loss": 0.1055,
+      "step": 1130
+    },
+    {
+      "epoch": 3.3585185185185185,
+      "grad_norm": 1.0043748617172241,
+      "learning_rate": 8.040153788784529e-06,
+      "loss": 0.0999,
+      "step": 1135
+    },
+    {
+      "epoch": 3.3733333333333335,
+      "grad_norm": 0.9208621978759766,
+      "learning_rate": 7.910444777524973e-06,
+      "loss": 0.1195,
+      "step": 1140
+    },
+    {
+      "epoch": 3.388148148148148,
+      "grad_norm": 1.0462204217910767,
+      "learning_rate": 7.781414821660089e-06,
+      "loss": 0.0943,
+      "step": 1145
+    },
+    {
+      "epoch": 3.402962962962963,
+      "grad_norm": 0.9799985885620117,
+      "learning_rate": 7.653076280002925e-06,
+      "loss": 0.1221,
+      "step": 1150
+    },
+    {
+      "epoch": 3.417777777777778,
+      "grad_norm": 0.8395883440971375,
+      "learning_rate": 7.525441445141139e-06,
+      "loss": 0.1037,
+      "step": 1155
+    },
+    {
+      "epoch": 3.4325925925925924,
+      "grad_norm": 1.0844675302505493,
+      "learning_rate": 7.398522542259602e-06,
+      "loss": 0.1032,
+      "step": 1160
+    },
+    {
+      "epoch": 3.4474074074074075,
+      "grad_norm": 0.9907870888710022,
+      "learning_rate": 7.2723317279693956e-06,
+      "loss": 0.1186,
+      "step": 1165
+    },
+    {
+      "epoch": 3.462222222222222,
+      "grad_norm": 0.8970763683319092,
+      "learning_rate": 7.146881089143471e-06,
+      "loss": 0.1205,
+      "step": 1170
+    },
+    {
+      "epoch": 3.477037037037037,
+      "grad_norm": 0.9910985827445984,
+      "learning_rate": 7.022182641758906e-06,
+      "loss": 0.1068,
+      "step": 1175
+    },
+    {
+      "epoch": 3.4918518518518518,
+      "grad_norm": 0.9317256212234497,
+      "learning_rate": 6.898248329745998e-06,
+      "loss": 0.1175,
+      "step": 1180
+    },
+    {
+      "epoch": 3.506666666666667,
+      "grad_norm": 1.1247984170913696,
+      "learning_rate": 6.775090023844237e-06,
+      "loss": 0.119,
+      "step": 1185
+    },
+    {
+      "epoch": 3.5214814814814814,
+      "grad_norm": 0.9871118068695068,
+      "learning_rate": 6.6527195204653094e-06,
+      "loss": 0.098,
+      "step": 1190
+    },
+    {
+      "epoch": 3.536296296296296,
+      "grad_norm": 0.8196695446968079,
+      "learning_rate": 6.531148540563175e-06,
+      "loss": 0.1094,
+      "step": 1195
+    },
+    {
+      "epoch": 3.551111111111111,
+      "grad_norm": 1.0164344310760498,
+      "learning_rate": 6.410388728511454e-06,
+      "loss": 0.1216,
+      "step": 1200
+    },
+    {
+      "epoch": 3.565925925925926,
+      "grad_norm": 0.9530230164527893,
+      "learning_rate": 6.29045165098806e-06,
+      "loss": 0.0928,
+      "step": 1205
+    },
+    {
+      "epoch": 3.580740740740741,
+      "grad_norm": 0.866951584815979,
+      "learning_rate": 6.171348795867332e-06,
+      "loss": 0.106,
+      "step": 1210
+    },
+    {
+      "epoch": 3.5955555555555554,
+      "grad_norm": 0.9906113743782043,
+      "learning_rate": 6.053091571119695e-06,
+      "loss": 0.0996,
+      "step": 1215
+    },
+    {
+      "epoch": 3.6103703703703705,
+      "grad_norm": 0.8702291250228882,
+      "learning_rate": 5.935691303718977e-06,
+      "loss": 0.1008,
+      "step": 1220
+    },
+    {
+      "epoch": 3.625185185185185,
+      "grad_norm": 1.0035004615783691,
+      "learning_rate": 5.8191592385574636e-06,
+      "loss": 0.1048,
+      "step": 1225
+    },
+    {
+      "epoch": 3.64,
+      "grad_norm": 0.9619163870811462,
+      "learning_rate": 5.703506537368869e-06,
+      "loss": 0.0991,
+      "step": 1230
+    },
+    {
+      "epoch": 3.6548148148148147,
+      "grad_norm": 1.143715262413025,
+      "learning_rate": 5.588744277659211e-06,
+      "loss": 0.092,
+      "step": 1235
+    },
+    {
+      "epoch": 3.66962962962963,
+      "grad_norm": 0.8306826949119568,
+      "learning_rate": 5.474883451645791e-06,
+      "loss": 0.0918,
+      "step": 1240
+    },
+    {
+      "epoch": 3.6844444444444444,
+      "grad_norm": 0.9245715737342834,
+      "learning_rate": 5.3619349652043255e-06,
+      "loss": 0.1004,
+      "step": 1245
+    },
+    {
+      "epoch": 3.699259259259259,
+      "grad_norm": 0.871616005897522,
+      "learning_rate": 5.249909636824361e-06,
+      "loss": 0.0901,
+      "step": 1250
+    },
+    {
+      "epoch": 3.714074074074074,
+      "grad_norm": 1.1003148555755615,
+      "learning_rate": 5.138818196573034e-06,
+      "loss": 0.0974,
+      "step": 1255
+    },
+    {
+      "epoch": 3.728888888888889,
+      "grad_norm": 0.8824198842048645,
+      "learning_rate": 5.028671285067349e-06,
+      "loss": 0.1203,
+      "step": 1260
+    },
+    {
+      "epoch": 3.7437037037037038,
+      "grad_norm": 0.8431242108345032,
+      "learning_rate": 4.919479452454969e-06,
+      "loss": 0.0943,
+      "step": 1265
+    },
+    {
+      "epoch": 3.7585185185185184,
+      "grad_norm": 0.8721339106559753,
+      "learning_rate": 4.8112531574037e-06,
+      "loss": 0.0949,
+      "step": 1270
+    },
+    {
+      "epoch": 3.7733333333333334,
+      "grad_norm": 0.9243516325950623,
+      "learning_rate": 4.704002766099746e-06,
+      "loss": 0.103,
+      "step": 1275
+    },
+    {
+      "epoch": 3.788148148148148,
+      "grad_norm": 0.8689306974411011,
+      "learning_rate": 4.597738551254795e-06,
+      "loss": 0.0922,
+      "step": 1280
+    },
+    {
+      "epoch": 3.802962962962963,
+      "grad_norm": 0.9284753203392029,
+      "learning_rate": 4.492470691122069e-06,
+      "loss": 0.1073,
+      "step": 1285
+    },
+    {
+      "epoch": 3.8177777777777777,
+      "grad_norm": 0.8651754260063171,
+      "learning_rate": 4.388209268521451e-06,
+      "loss": 0.106,
+      "step": 1290
+    },
+    {
+      "epoch": 3.8325925925925928,
+      "grad_norm": 0.9527820348739624,
+      "learning_rate": 4.284964269873704e-06,
+      "loss": 0.0908,
+      "step": 1295
+    },
+    {
+      "epoch": 3.8474074074074074,
+      "grad_norm": 0.8584031462669373,
+      "learning_rate": 4.18274558424395e-06,
+      "loss": 0.0876,
+      "step": 1300
+    },
+    {
+      "epoch": 3.862222222222222,
+      "grad_norm": 0.8430463671684265,
+      "learning_rate": 4.081563002394478e-06,
+      "loss": 0.1064,
+      "step": 1305
+    },
+    {
+      "epoch": 3.877037037037037,
+      "grad_norm": 0.8383373618125916,
+      "learning_rate": 3.981426215846964e-06,
+      "loss": 0.0944,
+      "step": 1310
+    },
+    {
+      "epoch": 3.891851851851852,
+      "grad_norm": 1.1086126565933228,
+      "learning_rate": 3.882344815954164e-06,
+      "loss": 0.0972,
+      "step": 1315
+    },
+    {
+      "epoch": 3.9066666666666667,
+      "grad_norm": 0.7546128034591675,
+      "learning_rate": 3.784328292981268e-06,
+      "loss": 0.0826,
+      "step": 1320
+    },
+    {
+      "epoch": 3.9214814814814813,
+      "grad_norm": 1.136211633682251,
+      "learning_rate": 3.687386035196879e-06,
+      "loss": 0.0947,
+      "step": 1325
+    },
+    {
+      "epoch": 3.9362962962962964,
+      "grad_norm": 0.8514955639839172,
+      "learning_rate": 3.59152732797378e-06,
+      "loss": 0.094,
+      "step": 1330
+    },
+    {
+      "epoch": 3.951111111111111,
+      "grad_norm": 0.8677727580070496,
+      "learning_rate": 3.4967613528995686e-06,
+      "loss": 0.0881,
+      "step": 1335
+    },
+    {
+      "epoch": 3.965925925925926,
+      "grad_norm": 0.9636668562889099,
+      "learning_rate": 3.4030971868972e-06,
+      "loss": 0.0882,
+      "step": 1340
+    },
+    {
+      "epoch": 3.9807407407407407,
+      "grad_norm": 0.9266122579574585,
+      "learning_rate": 3.3105438013556046e-06,
+      "loss": 0.0915,
+      "step": 1345
+    },
+    {
+      "epoch": 3.9955555555555557,
+      "grad_norm": 0.7577658891677856,
+      "learning_rate": 3.219110061270366e-06,
+      "loss": 0.0866,
+      "step": 1350
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 1690,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 5,
+  "save_steps": 2000,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.9159945491890831e+18,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

12_128_e5_3e-5/checkpoint-1352/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf3de353e55c6bf100f84167a13e5a6eca9560d421b12215d80a3ea91d88ef2e
+size 7736

12_128_e5_3e-5/checkpoint-1352/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1352/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,604 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example: python zero_to_fp32.py . pytorch_model.bin
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in files:
+        state_dict = torch.load(f, map_location=device)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    if zero_stage <= 2:
+        fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    elif zero_stage == 3:
+        # if there is more than one param group, there will be multiple flattened tensors - one
+        # flattened tensor per group - for simplicity merge them into a single tensor
+        #
+        # XXX: could make the script more memory efficient for when there are multiple groups - it
+        # will require matching the sub-lists of param_shapes for each param group flattened tensor
+        fp32_flat_groups = [
+            torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
+        ]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = fp32_flat_groups[0].numel() * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    for name, shape in param_shapes.items():
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # XXX: memory usage doubles here
+        state_dict[name] = torch.cat(
+            tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
+            0).narrow(0, 0, unpartitioned_numel).view(shape)
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    Returns:
+        - pytorch ``state_dict``
+    Note: this approach may not work if your application doesn't have sufficient free CPU memory and
+    you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None, exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
+    print(f"Saving fp32 state dict to {output_file}")
+    torch.save(state_dict, output_file)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info(f"Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info(f"Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument(
+        "output_file",
+        type=str,
+        help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_file,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

12_128_e5_3e-5/checkpoint-1690/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: ibm-granite/granite-3.3-8b-base
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.15.2

12_128_e5_3e-5/checkpoint-1690/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "ibm-granite/granite-3.3-8b-base",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "up_proj",
+    "k_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

12_128_e5_3e-5/checkpoint-1690/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:75c22dff7c5053064acb63e35d37778dc73cafc617dcfaf31503630665d472c9
+size 791751704

12_128_e5_3e-5/checkpoint-1690/latest ADDED Viewed

	@@ -0,0 +1 @@


1	+ global_step1690

12_128_e5_3e-5/checkpoint-1690/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

12_128_e5_3e-5/checkpoint-1690/rng_state_0.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24e1b2598d7cfb3b130c9f21a1691a4652c9a1fde23a7ed5ecfb9ab5340b61a8
+size 15920

12_128_e5_3e-5/checkpoint-1690/rng_state_1.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a352d3c083a65cf7aefc04e59b7330289157c3742fdea9d1fee2ee071e7313cd
+size 15920

12_128_e5_3e-5/checkpoint-1690/rng_state_2.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:229e30e97d7b92969142bea09cf2208d7300655c65a382972254bbf8f33d2a19
+size 15920