Upload GRPO checkpoint

Browse files

Files changed (13) hide show

README.md +202 -0
adapter_config.json +37 -0
adapter_model.safetensors +3 -0
merges.txt +0 -0
optimizer.pt +3 -0
rng_state.pth +3 -0
scheduler.pt +3 -0
special_tokens_map.json +34 -0
tokenizer.json +0 -0
tokenizer_config.json +156 -0
trainer_state.json +1533 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: HuggingFaceTB/SmolLM-135M-Instruct
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.14.0

adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM-135M-Instruct",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.0,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "k_proj",
+    "down_proj",
+    "gate_proj",
+    "up_proj",
+    "o_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44ee0627257647596ecd0b8b6522a2c65a44f911b5caf126f36d8d55cb9f329f
+size 19593064

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:753c38e40080e13b7cb9f8d961fecd849e4283e6caa88ca27096020da10da052
+size 11469821

rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2fd74a833420e5b1acf5a63e3d2b3beaa8037bc5e953edd734227b301a39018
+size 14645

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:770e85488b2d6f2b41a590e6c2bb599f3a174081e14b8c48d3638681b4a06597
+size 1465

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": {
+    "content": "<|im_start|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "model_max_length": 2048,
+  "pad_token": "<|im_end|>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

trainer_state.json ADDED Viewed

	@@ -0,0 +1,1533 @@

+{
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 1.0,
+  "eval_steps": 500,
+  "global_step": 125,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "completion_length": 83.6171875,
+      "epoch": 0.008,
+      "grad_norm": 0.27611714601516724,
+      "kl": 0.0,
+      "learning_rate": 1.9840000000000003e-05,
+      "loss": -0.0,
+      "reward": -275.6640625,
+      "reward_std": 84.78042984008789,
+      "rewards/reward_len": -275.6640625,
+      "step": 1
+    },
+    {
+      "completion_length": 85.9765625,
+      "epoch": 0.016,
+      "grad_norm": 0.24193570017814636,
+      "kl": 0.0016019352478906512,
+      "learning_rate": 1.968e-05,
+      "loss": 0.0001,
+      "reward": -280.484375,
+      "reward_std": 81.01797485351562,
+      "rewards/reward_len": -280.484375,
+      "step": 2
+    },
+    {
+      "completion_length": 81.7421875,
+      "epoch": 0.024,
+      "grad_norm": 0.2550275921821594,
+      "kl": 0.0016417641891166568,
+      "learning_rate": 1.9520000000000003e-05,
+      "loss": 0.0001,
+      "reward": -268.625,
+      "reward_std": 87.85796737670898,
+      "rewards/reward_len": -268.625,
+      "step": 3
+    },
+    {
+      "completion_length": 85.6640625,
+      "epoch": 0.032,
+      "grad_norm": 0.19168736040592194,
+      "kl": 0.0015058275312185287,
+      "learning_rate": 1.936e-05,
+      "loss": 0.0001,
+      "reward": -277.484375,
+      "reward_std": 79.17444229125977,
+      "rewards/reward_len": -277.484375,
+      "step": 4
+    },
+    {
+      "completion_length": 84.4609375,
+      "epoch": 0.04,
+      "grad_norm": 0.26871994137763977,
+      "kl": 0.00157183624105528,
+      "learning_rate": 1.9200000000000003e-05,
+      "loss": 0.0001,
+      "reward": -278.15625,
+      "reward_std": 80.18792724609375,
+      "rewards/reward_len": -278.15625,
+      "step": 5
+    },
+    {
+      "completion_length": 87.515625,
+      "epoch": 0.048,
+      "grad_norm": 0.205452099442482,
+      "kl": 0.0016553726745769382,
+      "learning_rate": 1.904e-05,
+      "loss": 0.0001,
+      "reward": -292.8125,
+      "reward_std": 69.1296615600586,
+      "rewards/reward_len": -292.8125,
+      "step": 6
+    },
+    {
+      "completion_length": 85.0859375,
+      "epoch": 0.056,
+      "grad_norm": 0.3081033527851105,
+      "kl": 0.0017567658214829862,
+      "learning_rate": 1.8880000000000002e-05,
+      "loss": 0.0001,
+      "reward": -272.09375,
+      "reward_std": 76.55761337280273,
+      "rewards/reward_len": -272.09375,
+      "step": 7
+    },
+    {
+      "completion_length": 86.875,
+      "epoch": 0.064,
+      "grad_norm": 0.18402817845344543,
+      "kl": 0.001668189710471779,
+      "learning_rate": 1.8720000000000004e-05,
+      "loss": 0.0001,
+      "reward": -287.2421875,
+      "reward_std": 69.85560989379883,
+      "rewards/reward_len": -287.2421875,
+      "step": 8
+    },
+    {
+      "completion_length": 85.53125,
+      "epoch": 0.072,
+      "grad_norm": 0.2124641090631485,
+      "kl": 0.0017781511414796114,
+      "learning_rate": 1.8560000000000002e-05,
+      "loss": 0.0001,
+      "reward": -280.890625,
+      "reward_std": 76.06471252441406,
+      "rewards/reward_len": -280.890625,
+      "step": 9
+    },
+    {
+      "completion_length": 88.1328125,
+      "epoch": 0.08,
+      "grad_norm": 0.22793518006801605,
+      "kl": 0.0017411371227353811,
+      "learning_rate": 1.8400000000000003e-05,
+      "loss": 0.0001,
+      "reward": -287.3515625,
+      "reward_std": 66.20086669921875,
+      "rewards/reward_len": -287.3515625,
+      "step": 10
+    },
+    {
+      "completion_length": 86.4609375,
+      "epoch": 0.088,
+      "grad_norm": 0.24100051820278168,
+      "kl": 0.0018225719104520977,
+      "learning_rate": 1.824e-05,
+      "loss": 0.0001,
+      "reward": -285.78125,
+      "reward_std": 68.7593765258789,
+      "rewards/reward_len": -285.78125,
+      "step": 11
+    },
+    {
+      "completion_length": 88.09375,
+      "epoch": 0.096,
+      "grad_norm": 0.1997087448835373,
+      "kl": 0.0018586200894787908,
+      "learning_rate": 1.8080000000000003e-05,
+      "loss": 0.0001,
+      "reward": -295.7734375,
+      "reward_std": 71.75521087646484,
+      "rewards/reward_len": -295.7734375,
+      "step": 12
+    },
+    {
+      "completion_length": 82.984375,
+      "epoch": 0.104,
+      "grad_norm": 0.24254590272903442,
+      "kl": 0.001945681986398995,
+      "learning_rate": 1.792e-05,
+      "loss": 0.0001,
+      "reward": -278.640625,
+      "reward_std": 82.34563064575195,
+      "rewards/reward_len": -278.640625,
+      "step": 13
+    },
+    {
+      "completion_length": 89.375,
+      "epoch": 0.112,
+      "grad_norm": 0.21290838718414307,
+      "kl": 0.0018120321328751743,
+      "learning_rate": 1.7760000000000003e-05,
+      "loss": 0.0001,
+      "reward": -298.953125,
+      "reward_std": 63.773353576660156,
+      "rewards/reward_len": -298.953125,
+      "step": 14
+    },
+    {
+      "completion_length": 89.21875,
+      "epoch": 0.12,
+      "grad_norm": 0.23192700743675232,
+      "kl": 0.0020369079429656267,
+      "learning_rate": 1.76e-05,
+      "loss": 0.0001,
+      "reward": -288.28125,
+      "reward_std": 66.4205322265625,
+      "rewards/reward_len": -288.28125,
+      "step": 15
+    },
+    {
+      "completion_length": 81.71875,
+      "epoch": 0.128,
+      "grad_norm": 0.2899166941642761,
+      "kl": 0.002046755747869611,
+      "learning_rate": 1.7440000000000002e-05,
+      "loss": 0.0001,
+      "reward": -262.53125,
+      "reward_std": 89.55130386352539,
+      "rewards/reward_len": -262.53125,
+      "step": 16
+    },
+    {
+      "completion_length": 86.21875,
+      "epoch": 0.136,
+      "grad_norm": 0.2126913219690323,
+      "kl": 0.0020263161277398467,
+      "learning_rate": 1.728e-05,
+      "loss": 0.0001,
+      "reward": -280.7890625,
+      "reward_std": 68.4266586303711,
+      "rewards/reward_len": -280.7890625,
+      "step": 17
+    },
+    {
+      "completion_length": 84.5,
+      "epoch": 0.144,
+      "grad_norm": 0.26783180236816406,
+      "kl": 0.002109398366883397,
+      "learning_rate": 1.7120000000000002e-05,
+      "loss": 0.0001,
+      "reward": -284.234375,
+      "reward_std": 78.40861129760742,
+      "rewards/reward_len": -284.234375,
+      "step": 18
+    },
+    {
+      "completion_length": 84.8671875,
+      "epoch": 0.152,
+      "grad_norm": 0.2541547417640686,
+      "kl": 0.002448722836561501,
+      "learning_rate": 1.696e-05,
+      "loss": 0.0001,
+      "reward": -278.34375,
+      "reward_std": 74.83174514770508,
+      "rewards/reward_len": -278.34375,
+      "step": 19
+    },
+    {
+      "completion_length": 86.265625,
+      "epoch": 0.16,
+      "grad_norm": 0.2557583749294281,
+      "kl": 0.0023585634771734476,
+      "learning_rate": 1.6800000000000002e-05,
+      "loss": 0.0001,
+      "reward": -277.671875,
+      "reward_std": 72.68540573120117,
+      "rewards/reward_len": -277.671875,
+      "step": 20
+    },
+    {
+      "completion_length": 80.4765625,
+      "epoch": 0.168,
+      "grad_norm": 0.3251549303531647,
+      "kl": 0.0028729301411658525,
+      "learning_rate": 1.664e-05,
+      "loss": 0.0001,
+      "reward": -261.0703125,
+      "reward_std": 83.05058288574219,
+      "rewards/reward_len": -261.0703125,
+      "step": 21
+    },
+    {
+      "completion_length": 84.1875,
+      "epoch": 0.176,
+      "grad_norm": 0.2614556849002838,
+      "kl": 0.00248900568112731,
+      "learning_rate": 1.648e-05,
+      "loss": 0.0001,
+      "reward": -271.296875,
+      "reward_std": 73.66436004638672,
+      "rewards/reward_len": -271.296875,
+      "step": 22
+    },
+    {
+      "completion_length": 84.0,
+      "epoch": 0.184,
+      "grad_norm": 0.24119850993156433,
+      "kl": 0.0027453062357380986,
+      "learning_rate": 1.632e-05,
+      "loss": 0.0001,
+      "reward": -278.0703125,
+      "reward_std": 80.45323944091797,
+      "rewards/reward_len": -278.0703125,
+      "step": 23
+    },
+    {
+      "completion_length": 83.7109375,
+      "epoch": 0.192,
+      "grad_norm": 0.2878279983997345,
+      "kl": 0.0028990991413593292,
+      "learning_rate": 1.616e-05,
+      "loss": 0.0001,
+      "reward": -275.8671875,
+      "reward_std": 78.57223129272461,
+      "rewards/reward_len": -275.8671875,
+      "step": 24
+    },
+    {
+      "completion_length": 82.0859375,
+      "epoch": 0.2,
+      "grad_norm": 0.32651785016059875,
+      "kl": 0.0033635327126830816,
+      "learning_rate": 1.6000000000000003e-05,
+      "loss": 0.0001,
+      "reward": -263.0546875,
+      "reward_std": 88.43621444702148,
+      "rewards/reward_len": -263.0546875,
+      "step": 25
+    },
+    {
+      "completion_length": 83.8125,
+      "epoch": 0.208,
+      "grad_norm": 0.3287697434425354,
+      "kl": 0.0036296172766014934,
+      "learning_rate": 1.584e-05,
+      "loss": 0.0001,
+      "reward": -263.1640625,
+      "reward_std": 85.08429336547852,
+      "rewards/reward_len": -263.1640625,
+      "step": 26
+    },
+    {
+      "completion_length": 86.546875,
+      "epoch": 0.216,
+      "grad_norm": 0.2324318289756775,
+      "kl": 0.003533878130838275,
+      "learning_rate": 1.5680000000000002e-05,
+      "loss": 0.0001,
+      "reward": -284.0,
+      "reward_std": 71.99230575561523,
+      "rewards/reward_len": -284.0,
+      "step": 27
+    },
+    {
+      "completion_length": 80.9765625,
+      "epoch": 0.224,
+      "grad_norm": 0.2990117073059082,
+      "kl": 0.003537152777425945,
+      "learning_rate": 1.552e-05,
+      "loss": 0.0001,
+      "reward": -262.2421875,
+      "reward_std": 84.41067504882812,
+      "rewards/reward_len": -262.2421875,
+      "step": 28
+    },
+    {
+      "completion_length": 80.828125,
+      "epoch": 0.232,
+      "grad_norm": 0.30853134393692017,
+      "kl": 0.003928918507881463,
+      "learning_rate": 1.5360000000000002e-05,
+      "loss": 0.0002,
+      "reward": -267.546875,
+      "reward_std": 80.8175163269043,
+      "rewards/reward_len": -267.546875,
+      "step": 29
+    },
+    {
+      "completion_length": 80.0703125,
+      "epoch": 0.24,
+      "grad_norm": 0.23498044908046722,
+      "kl": 0.004375259391963482,
+      "learning_rate": 1.5200000000000002e-05,
+      "loss": 0.0002,
+      "reward": -265.328125,
+      "reward_std": 86.1061897277832,
+      "rewards/reward_len": -265.328125,
+      "step": 30
+    },
+    {
+      "completion_length": 88.2734375,
+      "epoch": 0.248,
+      "grad_norm": 0.2088949978351593,
+      "kl": 0.004020389053039253,
+      "learning_rate": 1.5040000000000002e-05,
+      "loss": 0.0002,
+      "reward": -300.4375,
+      "reward_std": 56.174089431762695,
+      "rewards/reward_len": -300.4375,
+      "step": 31
+    },
+    {
+      "completion_length": 79.6328125,
+      "epoch": 0.256,
+      "grad_norm": 0.2648564279079437,
+      "kl": 0.0039774111937731504,
+      "learning_rate": 1.4880000000000002e-05,
+      "loss": 0.0002,
+      "reward": -247.53125,
+      "reward_std": 91.14960479736328,
+      "rewards/reward_len": -247.53125,
+      "step": 32
+    },
+    {
+      "completion_length": 84.4765625,
+      "epoch": 0.264,
+      "grad_norm": 0.2718077003955841,
+      "kl": 0.0045519310515373945,
+      "learning_rate": 1.4720000000000001e-05,
+      "loss": 0.0002,
+      "reward": -273.90625,
+      "reward_std": 84.63788223266602,
+      "rewards/reward_len": -273.90625,
+      "step": 33
+    },
+    {
+      "completion_length": 80.0078125,
+      "epoch": 0.272,
+      "grad_norm": 0.317321240901947,
+      "kl": 0.006291602039709687,
+      "learning_rate": 1.4560000000000001e-05,
+      "loss": 0.0003,
+      "reward": -258.078125,
+      "reward_std": 93.88013458251953,
+      "rewards/reward_len": -258.078125,
+      "step": 34
+    },
+    {
+      "completion_length": 81.375,
+      "epoch": 0.28,
+      "grad_norm": 0.33053135871887207,
+      "kl": 0.006775364512577653,
+      "learning_rate": 1.4400000000000001e-05,
+      "loss": 0.0003,
+      "reward": -258.75,
+      "reward_std": 89.74599838256836,
+      "rewards/reward_len": -258.75,
+      "step": 35
+    },
+    {
+      "completion_length": 80.1328125,
+      "epoch": 0.288,
+      "grad_norm": 0.2549481987953186,
+      "kl": 0.006435388699173927,
+      "learning_rate": 1.4240000000000001e-05,
+      "loss": 0.0003,
+      "reward": -255.7578125,
+      "reward_std": 84.49188613891602,
+      "rewards/reward_len": -255.7578125,
+      "step": 36
+    },
+    {
+      "completion_length": 84.015625,
+      "epoch": 0.296,
+      "grad_norm": 0.26275134086608887,
+      "kl": 0.007527944631874561,
+      "learning_rate": 1.408e-05,
+      "loss": 0.0003,
+      "reward": -271.5,
+      "reward_std": 81.36652374267578,
+      "rewards/reward_len": -271.5,
+      "step": 37
+    },
+    {
+      "completion_length": 84.4921875,
+      "epoch": 0.304,
+      "grad_norm": 0.23260033130645752,
+      "kl": 0.007381861563771963,
+      "learning_rate": 1.392e-05,
+      "loss": 0.0003,
+      "reward": -279.6015625,
+      "reward_std": 78.64627838134766,
+      "rewards/reward_len": -279.6015625,
+      "step": 38
+    },
+    {
+      "completion_length": 80.375,
+      "epoch": 0.312,
+      "grad_norm": 0.247163787484169,
+      "kl": 0.010356656275689602,
+      "learning_rate": 1.376e-05,
+      "loss": 0.0004,
+      "reward": -256.9140625,
+      "reward_std": 84.62659454345703,
+      "rewards/reward_len": -256.9140625,
+      "step": 39
+    },
+    {
+      "completion_length": 77.8828125,
+      "epoch": 0.32,
+      "grad_norm": 0.2849870026111603,
+      "kl": 0.009492217563092709,
+      "learning_rate": 1.3600000000000002e-05,
+      "loss": 0.0004,
+      "reward": -252.5,
+      "reward_std": 92.55280303955078,
+      "rewards/reward_len": -252.5,
+      "step": 40
+    },
+    {
+      "completion_length": 81.671875,
+      "epoch": 0.328,
+      "grad_norm": 0.23455281555652618,
+      "kl": 0.00837273895740509,
+      "learning_rate": 1.3440000000000002e-05,
+      "loss": 0.0003,
+      "reward": -261.28125,
+      "reward_std": 88.09553146362305,
+      "rewards/reward_len": -261.28125,
+      "step": 41
+    },
+    {
+      "completion_length": 80.703125,
+      "epoch": 0.336,
+      "grad_norm": 0.24879899621009827,
+      "kl": 0.009242677595466375,
+      "learning_rate": 1.3280000000000002e-05,
+      "loss": 0.0004,
+      "reward": -261.09375,
+      "reward_std": 85.93866729736328,
+      "rewards/reward_len": -261.09375,
+      "step": 42
+    },
+    {
+      "completion_length": 77.953125,
+      "epoch": 0.344,
+      "grad_norm": 0.27170729637145996,
+      "kl": 0.010577988345175982,
+      "learning_rate": 1.3120000000000001e-05,
+      "loss": 0.0004,
+      "reward": -256.359375,
+      "reward_std": 100.68645477294922,
+      "rewards/reward_len": -256.359375,
+      "step": 43
+    },
+    {
+      "completion_length": 82.5078125,
+      "epoch": 0.352,
+      "grad_norm": 0.33034706115722656,
+      "kl": 0.012457190081477165,
+      "learning_rate": 1.2960000000000001e-05,
+      "loss": 0.0005,
+      "reward": -271.359375,
+      "reward_std": 86.38698196411133,
+      "rewards/reward_len": -271.359375,
+      "step": 44
+    },
+    {
+      "completion_length": 78.9375,
+      "epoch": 0.36,
+      "grad_norm": 0.2691597640514374,
+      "kl": 0.01118185417726636,
+      "learning_rate": 1.2800000000000001e-05,
+      "loss": 0.0004,
+      "reward": -256.8515625,
+      "reward_std": 93.89360046386719,
+      "rewards/reward_len": -256.8515625,
+      "step": 45
+    },
+    {
+      "completion_length": 71.3984375,
+      "epoch": 0.368,
+      "grad_norm": 0.269377738237381,
+      "kl": 0.015074468217790127,
+      "learning_rate": 1.2640000000000001e-05,
+      "loss": 0.0006,
+      "reward": -232.1953125,
+      "reward_std": 99.15189361572266,
+      "rewards/reward_len": -232.1953125,
+      "step": 46
+    },
+    {
+      "completion_length": 80.34375,
+      "epoch": 0.376,
+      "grad_norm": 0.27561530470848083,
+      "kl": 0.013093658722937107,
+      "learning_rate": 1.248e-05,
+      "loss": 0.0005,
+      "reward": -259.2109375,
+      "reward_std": 88.95073699951172,
+      "rewards/reward_len": -259.2109375,
+      "step": 47
+    },
+    {
+      "completion_length": 80.0625,
+      "epoch": 0.384,
+      "grad_norm": 0.25536656379699707,
+      "kl": 0.012508484534919262,
+      "learning_rate": 1.232e-05,
+      "loss": 0.0005,
+      "reward": -260.2421875,
+      "reward_std": 86.21063995361328,
+      "rewards/reward_len": -260.2421875,
+      "step": 48
+    },
+    {
+      "completion_length": 76.8046875,
+      "epoch": 0.392,
+      "grad_norm": 0.27977946400642395,
+      "kl": 0.015685537364333868,
+      "learning_rate": 1.216e-05,
+      "loss": 0.0006,
+      "reward": -248.8046875,
+      "reward_std": 95.17082214355469,
+      "rewards/reward_len": -248.8046875,
+      "step": 49
+    },
+    {
+      "completion_length": 73.6171875,
+      "epoch": 0.4,
+      "grad_norm": 0.236115962266922,
+      "kl": 0.016457609832286835,
+      "learning_rate": 1.2e-05,
+      "loss": 0.0007,
+      "reward": -239.4765625,
+      "reward_std": 91.31077575683594,
+      "rewards/reward_len": -239.4765625,
+      "step": 50
+    },
+    {
+      "completion_length": 75.34375,
+      "epoch": 0.408,
+      "grad_norm": 0.23821134865283966,
+      "kl": 0.017242705449461937,
+      "learning_rate": 1.184e-05,
+      "loss": 0.0007,
+      "reward": -242.8984375,
+      "reward_std": 87.55542755126953,
+      "rewards/reward_len": -242.8984375,
+      "step": 51
+    },
+    {
+      "completion_length": 78.34375,
+      "epoch": 0.416,
+      "grad_norm": 0.2633512616157532,
+      "kl": 0.015819720923900604,
+      "learning_rate": 1.168e-05,
+      "loss": 0.0006,
+      "reward": -254.7265625,
+      "reward_std": 86.55766296386719,
+      "rewards/reward_len": -254.7265625,
+      "step": 52
+    },
+    {
+      "completion_length": 73.1875,
+      "epoch": 0.424,
+      "grad_norm": 0.30414703488349915,
+      "kl": 0.01967682968825102,
+      "learning_rate": 1.152e-05,
+      "loss": 0.0008,
+      "reward": -237.234375,
+      "reward_std": 94.06227493286133,
+      "rewards/reward_len": -237.234375,
+      "step": 53
+    },
+    {
+      "completion_length": 79.234375,
+      "epoch": 0.432,
+      "grad_norm": 0.25906994938850403,
+      "kl": 0.019587570801377296,
+      "learning_rate": 1.136e-05,
+      "loss": 0.0008,
+      "reward": -254.640625,
+      "reward_std": 84.00724792480469,
+      "rewards/reward_len": -254.640625,
+      "step": 54
+    },
+    {
+      "completion_length": 77.1328125,
+      "epoch": 0.44,
+      "grad_norm": 0.3377876877784729,
+      "kl": 0.023404529318213463,
+      "learning_rate": 1.1200000000000001e-05,
+      "loss": 0.0009,
+      "reward": -255.6796875,
+      "reward_std": 97.86257553100586,
+      "rewards/reward_len": -255.6796875,
+      "step": 55
+    },
+    {
+      "completion_length": 77.4375,
+      "epoch": 0.448,
+      "grad_norm": 0.28870245814323425,
+      "kl": 0.020200904458761215,
+      "learning_rate": 1.1040000000000001e-05,
+      "loss": 0.0008,
+      "reward": -244.7265625,
+      "reward_std": 94.0616226196289,
+      "rewards/reward_len": -244.7265625,
+      "step": 56
+    },
+    {
+      "completion_length": 72.6796875,
+      "epoch": 0.456,
+      "grad_norm": 0.39642176032066345,
+      "kl": 0.027743499726057053,
+      "learning_rate": 1.0880000000000001e-05,
+      "loss": 0.0011,
+      "reward": -230.0234375,
+      "reward_std": 95.595703125,
+      "rewards/reward_len": -230.0234375,
+      "step": 57
+    },
+    {
+      "completion_length": 66.8671875,
+      "epoch": 0.464,
+      "grad_norm": 0.2797935903072357,
+      "kl": 0.026664892211556435,
+      "learning_rate": 1.072e-05,
+      "loss": 0.0011,
+      "reward": -207.6171875,
+      "reward_std": 88.0827865600586,
+      "rewards/reward_len": -207.6171875,
+      "step": 58
+    },
+    {
+      "completion_length": 69.546875,
+      "epoch": 0.472,
+      "grad_norm": 0.3475685119628906,
+      "kl": 0.025908864103257656,
+      "learning_rate": 1.056e-05,
+      "loss": 0.001,
+      "reward": -222.6796875,
+      "reward_std": 104.85270690917969,
+      "rewards/reward_len": -222.6796875,
+      "step": 59
+    },
+    {
+      "completion_length": 72.265625,
+      "epoch": 0.48,
+      "grad_norm": 0.28196772933006287,
+      "kl": 0.028263960033655167,
+      "learning_rate": 1.04e-05,
+      "loss": 0.0011,
+      "reward": -235.6953125,
+      "reward_std": 102.04790496826172,
+      "rewards/reward_len": -235.6953125,
+      "step": 60
+    },
+    {
+      "completion_length": 73.3984375,
+      "epoch": 0.488,
+      "grad_norm": 0.2515757977962494,
+      "kl": 0.02572808228433132,
+      "learning_rate": 1.024e-05,
+      "loss": 0.001,
+      "reward": -231.375,
+      "reward_std": 103.71851348876953,
+      "rewards/reward_len": -231.375,
+      "step": 61
+    },
+    {
+      "completion_length": 70.1328125,
+      "epoch": 0.496,
+      "grad_norm": 0.2696300446987152,
+      "kl": 0.030540384352207184,
+      "learning_rate": 1.008e-05,
+      "loss": 0.0012,
+      "reward": -228.625,
+      "reward_std": 99.47586822509766,
+      "rewards/reward_len": -228.625,
+      "step": 62
+    },
+    {
+      "completion_length": 68.890625,
+      "epoch": 0.504,
+      "grad_norm": 0.3310464918613434,
+      "kl": 0.031847656704485416,
+      "learning_rate": 9.920000000000002e-06,
+      "loss": 0.0013,
+      "reward": -221.4140625,
+      "reward_std": 103.48392868041992,
+      "rewards/reward_len": -221.4140625,
+      "step": 63
+    },
+    {
+      "completion_length": 72.90625,
+      "epoch": 0.512,
+      "grad_norm": 0.28872978687286377,
+      "kl": 0.03188143577426672,
+      "learning_rate": 9.760000000000001e-06,
+      "loss": 0.0013,
+      "reward": -231.578125,
+      "reward_std": 102.58806991577148,
+      "rewards/reward_len": -231.578125,
+      "step": 64
+    },
+    {
+      "completion_length": 74.046875,
+      "epoch": 0.52,
+      "grad_norm": 0.2885415554046631,
+      "kl": 0.030631499364972115,
+      "learning_rate": 9.600000000000001e-06,
+      "loss": 0.0012,
+      "reward": -237.421875,
+      "reward_std": 102.28324127197266,
+      "rewards/reward_len": -237.421875,
+      "step": 65
+    },
+    {
+      "completion_length": 68.109375,
+      "epoch": 0.528,
+      "grad_norm": 0.4051050841808319,
+      "kl": 0.03556988015770912,
+      "learning_rate": 9.440000000000001e-06,
+      "loss": 0.0014,
+      "reward": -205.1328125,
+      "reward_std": 98.69538879394531,
+      "rewards/reward_len": -205.1328125,
+      "step": 66
+    },
+    {
+      "completion_length": 63.8125,
+      "epoch": 0.536,
+      "grad_norm": 0.2801409959793091,
+      "kl": 0.036647289991378784,
+      "learning_rate": 9.280000000000001e-06,
+      "loss": 0.0015,
+      "reward": -200.96875,
+      "reward_std": 95.41398239135742,
+      "rewards/reward_len": -200.96875,
+      "step": 67
+    },
+    {
+      "completion_length": 69.8046875,
+      "epoch": 0.544,
+      "grad_norm": 0.2989080846309662,
+      "kl": 0.04065835103392601,
+      "learning_rate": 9.12e-06,
+      "loss": 0.0016,
+      "reward": -223.609375,
+      "reward_std": 99.66047286987305,
+      "rewards/reward_len": -223.609375,
+      "step": 68
+    },
+    {
+      "completion_length": 61.4921875,
+      "epoch": 0.552,
+      "grad_norm": 0.3857433795928955,
+      "kl": 0.050694407895207405,
+      "learning_rate": 8.96e-06,
+      "loss": 0.002,
+      "reward": -188.671875,
+      "reward_std": 104.92461013793945,
+      "rewards/reward_len": -188.671875,
+      "step": 69
+    },
+    {
+      "completion_length": 70.625,
+      "epoch": 0.56,
+      "grad_norm": 0.34728458523750305,
+      "kl": 0.0426182746887207,
+      "learning_rate": 8.8e-06,
+      "loss": 0.0017,
+      "reward": -221.9296875,
+      "reward_std": 100.83985900878906,
+      "rewards/reward_len": -221.9296875,
+      "step": 70
+    },
+    {
+      "completion_length": 61.9453125,
+      "epoch": 0.568,
+      "grad_norm": 0.3515882194042206,
+      "kl": 0.0488403607159853,
+      "learning_rate": 8.64e-06,
+      "loss": 0.002,
+      "reward": -186.09375,
+      "reward_std": 95.39920043945312,
+      "rewards/reward_len": -186.09375,
+      "step": 71
+    },
+    {
+      "completion_length": 70.28125,
+      "epoch": 0.576,
+      "grad_norm": 0.3187553286552429,
+      "kl": 0.046066541224718094,
+      "learning_rate": 8.48e-06,
+      "loss": 0.0018,
+      "reward": -222.3984375,
+      "reward_std": 102.15713882446289,
+      "rewards/reward_len": -222.3984375,
+      "step": 72
+    },
+    {
+      "completion_length": 67.40625,
+      "epoch": 0.584,
+      "grad_norm": 0.29995688796043396,
+      "kl": 0.04663568735122681,
+      "learning_rate": 8.32e-06,
+      "loss": 0.0019,
+      "reward": -222.078125,
+      "reward_std": 102.06698226928711,
+      "rewards/reward_len": -222.078125,
+      "step": 73
+    },
+    {
+      "completion_length": 60.90625,
+      "epoch": 0.592,
+      "grad_norm": 0.3082631826400757,
+      "kl": 0.06111626885831356,
+      "learning_rate": 8.16e-06,
+      "loss": 0.0024,
+      "reward": -196.5859375,
+      "reward_std": 111.21645736694336,
+      "rewards/reward_len": -196.5859375,
+      "step": 74
+    },
+    {
+      "completion_length": 66.7890625,
+      "epoch": 0.6,
+      "grad_norm": 0.36775076389312744,
+      "kl": 0.059231631457805634,
+      "learning_rate": 8.000000000000001e-06,
+      "loss": 0.0024,
+      "reward": -216.078125,
+      "reward_std": 104.56696701049805,
+      "rewards/reward_len": -216.078125,
+      "step": 75
+    },
+    {
+      "completion_length": 62.28125,
+      "epoch": 0.608,
+      "grad_norm": 0.518227219581604,
+      "kl": 0.06296529620885849,
+      "learning_rate": 7.840000000000001e-06,
+      "loss": 0.0025,
+      "reward": -196.703125,
+      "reward_std": 112.63176727294922,
+      "rewards/reward_len": -196.703125,
+      "step": 76
+    },
+    {
+      "completion_length": 64.71875,
+      "epoch": 0.616,
+      "grad_norm": 0.3513058125972748,
+      "kl": 0.05728099122643471,
+      "learning_rate": 7.680000000000001e-06,
+      "loss": 0.0023,
+      "reward": -205.7265625,
+      "reward_std": 97.13314437866211,
+      "rewards/reward_len": -205.7265625,
+      "step": 77
+    },
+    {
+      "completion_length": 53.53125,
+      "epoch": 0.624,
+      "grad_norm": 0.3801332116127014,
+      "kl": 0.06445236876606941,
+      "learning_rate": 7.520000000000001e-06,
+      "loss": 0.0026,
+      "reward": -163.7578125,
+      "reward_std": 87.36725234985352,
+      "rewards/reward_len": -163.7578125,
+      "step": 78
+    },
+    {
+      "completion_length": 62.5625,
+      "epoch": 0.632,
+      "grad_norm": 0.3122998774051666,
+      "kl": 0.0733240656554699,
+      "learning_rate": 7.360000000000001e-06,
+      "loss": 0.0029,
+      "reward": -190.4375,
+      "reward_std": 96.96488571166992,
+      "rewards/reward_len": -190.4375,
+      "step": 79
+    },
+    {
+      "completion_length": 67.421875,
+      "epoch": 0.64,
+      "grad_norm": 0.36507463455200195,
+      "kl": 0.06272974610328674,
+      "learning_rate": 7.2000000000000005e-06,
+      "loss": 0.0025,
+      "reward": -210.1953125,
+      "reward_std": 98.55888366699219,
+      "rewards/reward_len": -210.1953125,
+      "step": 80
+    },
+    {
+      "completion_length": 65.5546875,
+      "epoch": 0.648,
+      "grad_norm": 0.37362825870513916,
+      "kl": 0.05581993982195854,
+      "learning_rate": 7.04e-06,
+      "loss": 0.0022,
+      "reward": -214.296875,
+      "reward_std": 99.79366683959961,
+      "rewards/reward_len": -214.296875,
+      "step": 81
+    },
+    {
+      "completion_length": 58.2578125,
+      "epoch": 0.656,
+      "grad_norm": 0.33545634150505066,
+      "kl": 0.07562102749943733,
+      "learning_rate": 6.88e-06,
+      "loss": 0.003,
+      "reward": -179.1953125,
+      "reward_std": 106.04055404663086,
+      "rewards/reward_len": -179.1953125,
+      "step": 82
+    },
+    {
+      "completion_length": 60.1875,
+      "epoch": 0.664,
+      "grad_norm": 0.302776962518692,
+      "kl": 0.06511162593960762,
+      "learning_rate": 6.720000000000001e-06,
+      "loss": 0.0026,
+      "reward": -196.5546875,
+      "reward_std": 100.0388069152832,
+      "rewards/reward_len": -196.5546875,
+      "step": 83
+    },
+    {
+      "completion_length": 60.984375,
+      "epoch": 0.672,
+      "grad_norm": 0.3094371259212494,
+      "kl": 0.06165306642651558,
+      "learning_rate": 6.560000000000001e-06,
+      "loss": 0.0025,
+      "reward": -191.25,
+      "reward_std": 102.2006721496582,
+      "rewards/reward_len": -191.25,
+      "step": 84
+    },
+    {
+      "completion_length": 58.515625,
+      "epoch": 0.68,
+      "grad_norm": 0.3535930812358856,
+      "kl": 0.07348541542887688,
+      "learning_rate": 6.4000000000000006e-06,
+      "loss": 0.0029,
+      "reward": -179.3671875,
+      "reward_std": 101.92927932739258,
+      "rewards/reward_len": -179.3671875,
+      "step": 85
+    },
+    {
+      "completion_length": 59.6875,
+      "epoch": 0.688,
+      "grad_norm": 0.3270580470561981,
+      "kl": 0.07816123962402344,
+      "learning_rate": 6.24e-06,
+      "loss": 0.0031,
+      "reward": -186.6875,
+      "reward_std": 108.60737228393555,
+      "rewards/reward_len": -186.6875,
+      "step": 86
+    },
+    {
+      "completion_length": 60.125,
+      "epoch": 0.696,
+      "grad_norm": 0.3442067503929138,
+      "kl": 0.07194574177265167,
+      "learning_rate": 6.08e-06,
+      "loss": 0.0029,
+      "reward": -194.4140625,
+      "reward_std": 109.5257797241211,
+      "rewards/reward_len": -194.4140625,
+      "step": 87
+    },
+    {
+      "completion_length": 62.1171875,
+      "epoch": 0.704,
+      "grad_norm": 0.3669171631336212,
+      "kl": 0.08246365189552307,
+      "learning_rate": 5.92e-06,
+      "loss": 0.0033,
+      "reward": -194.84375,
+      "reward_std": 102.23075485229492,
+      "rewards/reward_len": -194.84375,
+      "step": 88
+    },
+    {
+      "completion_length": 56.5390625,
+      "epoch": 0.712,
+      "grad_norm": 0.39455512166023254,
+      "kl": 0.08890286087989807,
+      "learning_rate": 5.76e-06,
+      "loss": 0.0036,
+      "reward": -174.9140625,
+      "reward_std": 103.08227157592773,
+      "rewards/reward_len": -174.9140625,
+      "step": 89
+    },
+    {
+      "completion_length": 58.4609375,
+      "epoch": 0.72,
+      "grad_norm": 0.311391681432724,
+      "kl": 0.08321782201528549,
+      "learning_rate": 5.600000000000001e-06,
+      "loss": 0.0033,
+      "reward": -176.375,
+      "reward_std": 110.65545272827148,
+      "rewards/reward_len": -176.375,
+      "step": 90
+    },
+    {
+      "completion_length": 59.9140625,
+      "epoch": 0.728,
+      "grad_norm": 0.32542526721954346,
+      "kl": 0.09862783551216125,
+      "learning_rate": 5.4400000000000004e-06,
+      "loss": 0.0039,
+      "reward": -187.46875,
+      "reward_std": 102.97476577758789,
+      "rewards/reward_len": -187.46875,
+      "step": 91
+    },
+    {
+      "completion_length": 56.9921875,
+      "epoch": 0.736,
+      "grad_norm": 0.29923832416534424,
+      "kl": 0.09448355808854103,
+      "learning_rate": 5.28e-06,
+      "loss": 0.0038,
+      "reward": -176.96875,
+      "reward_std": 105.89245223999023,
+      "rewards/reward_len": -176.96875,
+      "step": 92
+    },
+    {
+      "completion_length": 60.2734375,
+      "epoch": 0.744,
+      "grad_norm": 0.44803038239479065,
+      "kl": 0.0962735190987587,
+      "learning_rate": 5.12e-06,
+      "loss": 0.0039,
+      "reward": -186.1953125,
+      "reward_std": 112.5185661315918,
+      "rewards/reward_len": -186.1953125,
+      "step": 93
+    },
+    {
+      "completion_length": 54.4296875,
+      "epoch": 0.752,
+      "grad_norm": 0.3237845301628113,
+      "kl": 0.08328106999397278,
+      "learning_rate": 4.960000000000001e-06,
+      "loss": 0.0033,
+      "reward": -164.609375,
+      "reward_std": 109.59135437011719,
+      "rewards/reward_len": -164.609375,
+      "step": 94
+    },
+    {
+      "completion_length": 50.5234375,
+      "epoch": 0.76,
+      "grad_norm": 0.35100388526916504,
+      "kl": 0.11928272992372513,
+      "learning_rate": 4.800000000000001e-06,
+      "loss": 0.0048,
+      "reward": -147.328125,
+      "reward_std": 93.42684936523438,
+      "rewards/reward_len": -147.328125,
+      "step": 95
+    },
+    {
+      "completion_length": 54.9140625,
+      "epoch": 0.768,
+      "grad_norm": 0.3689323663711548,
+      "kl": 0.09899459034204483,
+      "learning_rate": 4.6400000000000005e-06,
+      "loss": 0.004,
+      "reward": -166.234375,
+      "reward_std": 100.38239288330078,
+      "rewards/reward_len": -166.234375,
+      "step": 96
+    },
+    {
+      "completion_length": 62.0859375,
+      "epoch": 0.776,
+      "grad_norm": 0.3250354528427124,
+      "kl": 0.08518016710877419,
+      "learning_rate": 4.48e-06,
+      "loss": 0.0034,
+      "reward": -193.390625,
+      "reward_std": 109.75457000732422,
+      "rewards/reward_len": -193.390625,
+      "step": 97
+    },
+    {
+      "completion_length": 47.359375,
+      "epoch": 0.784,
+      "grad_norm": 0.35672515630722046,
+      "kl": 0.10733343660831451,
+      "learning_rate": 4.32e-06,
+      "loss": 0.0043,
+      "reward": -138.9140625,
+      "reward_std": 98.22686004638672,
+      "rewards/reward_len": -138.9140625,
+      "step": 98
+    },
+    {
+      "completion_length": 54.953125,
+      "epoch": 0.792,
+      "grad_norm": 0.3013683259487152,
+      "kl": 0.08840527385473251,
+      "learning_rate": 4.16e-06,
+      "loss": 0.0035,
+      "reward": -169.4765625,
+      "reward_std": 100.61940383911133,
+      "rewards/reward_len": -169.4765625,
+      "step": 99
+    },
+    {
+      "completion_length": 56.1640625,
+      "epoch": 0.8,
+      "grad_norm": 0.2968733012676239,
+      "kl": 0.0861380472779274,
+      "learning_rate": 4.000000000000001e-06,
+      "loss": 0.0034,
+      "reward": -175.1171875,
+      "reward_std": 102.34061431884766,
+      "rewards/reward_len": -175.1171875,
+      "step": 100
+    },
+    {
+      "completion_length": 52.203125,
+      "epoch": 0.808,
+      "grad_norm": 0.2913699150085449,
+      "kl": 0.09863747283816338,
+      "learning_rate": 3.8400000000000005e-06,
+      "loss": 0.0039,
+      "reward": -163.171875,
+      "reward_std": 97.69365310668945,
+      "rewards/reward_len": -163.171875,
+      "step": 101
+    },
+    {
+      "completion_length": 53.4140625,
+      "epoch": 0.816,
+      "grad_norm": 0.3692280948162079,
+      "kl": 0.21122310310602188,
+      "learning_rate": 3.6800000000000003e-06,
+      "loss": 0.0084,
+      "reward": -164.4140625,
+      "reward_std": 112.10655212402344,
+      "rewards/reward_len": -164.4140625,
+      "step": 102
+    },
+    {
+      "completion_length": 55.4453125,
+      "epoch": 0.824,
+      "grad_norm": 0.3529679775238037,
+      "kl": 0.09558240696787834,
+      "learning_rate": 3.52e-06,
+      "loss": 0.0038,
+      "reward": -175.6328125,
+      "reward_std": 110.75713729858398,
+      "rewards/reward_len": -175.6328125,
+      "step": 103
+    },
+    {
+      "completion_length": 52.75,
+      "epoch": 0.832,
+      "grad_norm": 0.31306901574134827,
+      "kl": 0.12027820944786072,
+      "learning_rate": 3.3600000000000004e-06,
+      "loss": 0.0048,
+      "reward": -164.265625,
+      "reward_std": 101.10466384887695,
+      "rewards/reward_len": -164.265625,
+      "step": 104
+    },
+    {
+      "completion_length": 57.3046875,
+      "epoch": 0.84,
+      "grad_norm": 0.2741624116897583,
+      "kl": 0.10724575072526932,
+      "learning_rate": 3.2000000000000003e-06,
+      "loss": 0.0043,
+      "reward": -178.9140625,
+      "reward_std": 108.06570434570312,
+      "rewards/reward_len": -178.9140625,
+      "step": 105
+    },
+    {
+      "completion_length": 56.625,
+      "epoch": 0.848,
+      "grad_norm": 0.3823026716709137,
+      "kl": 0.10432938858866692,
+      "learning_rate": 3.04e-06,
+      "loss": 0.0042,
+      "reward": -180.0625,
+      "reward_std": 106.51268768310547,
+      "rewards/reward_len": -180.0625,
+      "step": 106
+    },
+    {
+      "completion_length": 54.15625,
+      "epoch": 0.856,
+      "grad_norm": 0.3364756107330322,
+      "kl": 0.10648486390709877,
+      "learning_rate": 2.88e-06,
+      "loss": 0.0043,
+      "reward": -169.109375,
+      "reward_std": 104.56985855102539,
+      "rewards/reward_len": -169.109375,
+      "step": 107
+    },
+    {
+      "completion_length": 55.09375,
+      "epoch": 0.864,
+      "grad_norm": 0.41603678464889526,
+      "kl": 0.11112185195088387,
+      "learning_rate": 2.7200000000000002e-06,
+      "loss": 0.0044,
+      "reward": -167.2265625,
+      "reward_std": 109.89325714111328,
+      "rewards/reward_len": -167.2265625,
+      "step": 108
+    },
+    {
+      "completion_length": 49.546875,
+      "epoch": 0.872,
+      "grad_norm": 0.4080374538898468,
+      "kl": 0.12799223512411118,
+      "learning_rate": 2.56e-06,
+      "loss": 0.0051,
+      "reward": -146.6875,
+      "reward_std": 108.09561538696289,
+      "rewards/reward_len": -146.6875,
+      "step": 109
+    },
+    {
+      "completion_length": 53.6640625,
+      "epoch": 0.88,
+      "grad_norm": 0.33172112703323364,
+      "kl": 0.10977593064308167,
+      "learning_rate": 2.4000000000000003e-06,
+      "loss": 0.0044,
+      "reward": -162.4609375,
+      "reward_std": 104.02449035644531,
+      "rewards/reward_len": -162.4609375,
+      "step": 110
+    },
+    {
+      "completion_length": 52.8046875,
+      "epoch": 0.888,
+      "grad_norm": 0.2904779613018036,
+      "kl": 0.10688870772719383,
+      "learning_rate": 2.24e-06,
+      "loss": 0.0043,
+      "reward": -161.2265625,
+      "reward_std": 102.53210067749023,
+      "rewards/reward_len": -161.2265625,
+      "step": 111
+    },
+    {
+      "completion_length": 55.9609375,
+      "epoch": 0.896,
+      "grad_norm": 0.3141164481639862,
+      "kl": 0.10344751551747322,
+      "learning_rate": 2.08e-06,
+      "loss": 0.0041,
+      "reward": -174.4921875,
+      "reward_std": 110.17209243774414,
+      "rewards/reward_len": -174.4921875,
+      "step": 112
+    },
+    {
+      "completion_length": 54.0859375,
+      "epoch": 0.904,
+      "grad_norm": 0.3483990430831909,
+      "kl": 0.10751833021640778,
+      "learning_rate": 1.9200000000000003e-06,
+      "loss": 0.0043,
+      "reward": -163.1171875,
+      "reward_std": 100.39259338378906,
+      "rewards/reward_len": -163.1171875,
+      "step": 113
+    },
+    {
+      "completion_length": 46.0859375,
+      "epoch": 0.912,
+      "grad_norm": 0.3864634931087494,
+      "kl": 0.14491157233715057,
+      "learning_rate": 1.76e-06,
+      "loss": 0.0058,
+      "reward": -133.6171875,
+      "reward_std": 100.36182403564453,
+      "rewards/reward_len": -133.6171875,
+      "step": 114
+    },
+    {
+      "completion_length": 52.59375,
+      "epoch": 0.92,
+      "grad_norm": 0.38450247049331665,
+      "kl": 0.12875013053417206,
+      "learning_rate": 1.6000000000000001e-06,
+      "loss": 0.0052,
+      "reward": -168.3984375,
+      "reward_std": 95.37660217285156,
+      "rewards/reward_len": -168.3984375,
+      "step": 115
+    },
+    {
+      "completion_length": 52.6015625,
+      "epoch": 0.928,
+      "grad_norm": 0.32588890194892883,
+      "kl": 0.09462021291255951,
+      "learning_rate": 1.44e-06,
+      "loss": 0.0038,
+      "reward": -162.84375,
+      "reward_std": 89.71159362792969,
+      "rewards/reward_len": -162.84375,
+      "step": 116
+    },
+    {
+      "completion_length": 51.859375,
+      "epoch": 0.936,
+      "grad_norm": 0.33285778760910034,
+      "kl": 0.13061665371060371,
+      "learning_rate": 1.28e-06,
+      "loss": 0.0052,
+      "reward": -155.2265625,
+      "reward_std": 94.24004364013672,
+      "rewards/reward_len": -155.2265625,
+      "step": 117
+    },
+    {
+      "completion_length": 50.5625,
+      "epoch": 0.944,
+      "grad_norm": 0.36145925521850586,
+      "kl": 0.1183355301618576,
+      "learning_rate": 1.12e-06,
+      "loss": 0.0047,
+      "reward": -153.34375,
+      "reward_std": 97.1226806640625,
+      "rewards/reward_len": -153.34375,
+      "step": 118
+    },
+    {
+      "completion_length": 47.515625,
+      "epoch": 0.952,
+      "grad_norm": 0.29093804955482483,
+      "kl": 0.13388825953006744,
+      "learning_rate": 9.600000000000001e-07,
+      "loss": 0.0054,
+      "reward": -143.7109375,
+      "reward_std": 109.23156356811523,
+      "rewards/reward_len": -143.7109375,
+      "step": 119
+    },
+    {
+      "completion_length": 48.3359375,
+      "epoch": 0.96,
+      "grad_norm": 0.33335262537002563,
+      "kl": 0.11851062625646591,
+      "learning_rate": 8.000000000000001e-07,
+      "loss": 0.0047,
+      "reward": -154.125,
+      "reward_std": 98.02896499633789,
+      "rewards/reward_len": -154.125,
+      "step": 120
+    },
+    {
+      "completion_length": 51.5078125,
+      "epoch": 0.968,
+      "grad_norm": 0.3453635573387146,
+      "kl": 0.10678033903241158,
+      "learning_rate": 6.4e-07,
+      "loss": 0.0043,
+      "reward": -152.234375,
+      "reward_std": 99.61367797851562,
+      "rewards/reward_len": -152.234375,
+      "step": 121
+    },
+    {
+      "completion_length": 52.7578125,
+      "epoch": 0.976,
+      "grad_norm": 0.2886435389518738,
+      "kl": 0.11124991998076439,
+      "learning_rate": 4.800000000000001e-07,
+      "loss": 0.0044,
+      "reward": -154.7265625,
+      "reward_std": 103.94355392456055,
+      "rewards/reward_len": -154.7265625,
+      "step": 122
+    },
+    {
+      "completion_length": 48.890625,
+      "epoch": 0.984,
+      "grad_norm": 0.3157053291797638,
+      "kl": 0.1317698396742344,
+      "learning_rate": 3.2e-07,
+      "loss": 0.0053,
+      "reward": -145.2890625,
+      "reward_std": 103.36433029174805,
+      "rewards/reward_len": -145.2890625,
+      "step": 123
+    },
+    {
+      "completion_length": 52.0546875,
+      "epoch": 0.992,
+      "grad_norm": 0.29187846183776855,
+      "kl": 0.1326664201915264,
+      "learning_rate": 1.6e-07,
+      "loss": 0.0053,
+      "reward": -151.4375,
+      "reward_std": 94.39168548583984,
+      "rewards/reward_len": -151.4375,
+      "step": 124
+    },
+    {
+      "completion_length": 54.9453125,
+      "epoch": 1.0,
+      "grad_norm": 0.28168249130249023,
+      "kl": 0.1043478213250637,
+      "learning_rate": 0.0,
+      "loss": 0.0042,
+      "reward": -175.09375,
+      "reward_std": 90.59318923950195,
+      "rewards/reward_len": -175.09375,
+      "step": 125
+    }
+  ],
+  "logging_steps": 1,
+  "max_steps": 125,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 1,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fbcfe22a8e25b620462df438341bc2a92d276d4612ca55838b7c0f86fbb66fe
+size 5905

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff