Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

README.md +10 -2
checkpoint-98/README.md +202 -0
checkpoint-98/adapter_config.json +37 -0
checkpoint-98/adapter_model.safetensors +3 -0
checkpoint-98/merges.txt +0 -0
checkpoint-98/optimizer.pt +3 -0
checkpoint-98/rng_state.pth +3 -0
checkpoint-98/scheduler.pt +3 -0
checkpoint-98/special_tokens_map.json +28 -0
checkpoint-98/tokenizer.json +0 -0
checkpoint-98/tokenizer_config.json +156 -0
checkpoint-98/trainer_state.json +1209 -0
checkpoint-98/training_args.bin +3 -0
checkpoint-98/vocab.json +0 -0

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ licence: license
 # Model Card for SmolLM2-360M-GRPO-v0
 This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).
-It has been trained using [TRL](https://github.com/huggingface/trl).
 ## Quick start
@@ -30,7 +30,15 @@ print(output["generated_text"])
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/aathatte2002-indian-institute-of-technology/SmolLM-135M-finetune/runs/szfjiiio)
-This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
 ### Framework versions

 # Model Card for SmolLM2-360M-GRPO-v0
 This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).
+It has been finetuned using [TRL](https://github.com/huggingface/trl).
 ## Quick start
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/aathatte2002-indian-institute-of-technology/SmolLM-135M-finetune/runs/szfjiiio)
+This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300) and using the `lamini/taylor_swift` dataset.
+## Evals
+Referring this [blog post](https://datawizz.ai/blog/grpo-fine-tuning-qwen-0-5b-vs-openai-o1-preview), used a similar evaluation method:
+| Model | Average ROUGE-L |
+|-------|-----------------|
+| Qwen-0.5B | 0.3313 |
+| SmolLM2-360M-GRPO-v0 | 0.1644 |
 ### Framework versions

checkpoint-98/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: HuggingFaceTB/SmolLM2-360M
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.14.0

checkpoint-98/adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-360M",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "v_proj",
+    "o_proj",
+    "down_proj",
+    "gate_proj",
+    "up_proj",
+    "k_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

checkpoint-98/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a8a73287bc73eaf9dc00caaaf4e47698e1b8d9c0f806f7020173f17c6872e0c
+size 69527352

checkpoint-98/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-98/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:afb11a746bc09322854bc435cd229c91092bca5fb0a0144d75f74ff54a645208
+size 139313234

checkpoint-98/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6870e6b994954e2a4b66879b85dd674222b4f12959ddba4fd1717b1bc294b4fa
+size 14244

checkpoint-98/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7fd29e5eb4f01a063a0d33b5b87cade51a37a47d8a87948817fa436b1777af9
+size 1064

checkpoint-98/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": "<|im_start|>",
+  "eos_token": "<|im_end|>",
+  "pad_token": "<|im_end|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

checkpoint-98/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-98/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

checkpoint-98/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1209 @@

+{
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 1.0,
+  "eval_steps": 500,
+  "global_step": 98,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "completion_length": 160.0,
+      "epoch": 0.01020408163265306,
+      "grad_norm": 0.16015149652957916,
+      "kl": 0.0,
+      "learning_rate": 1.0000000000000002e-06,
+      "loss": 0.0,
+      "reward": 0.13549137115478516,
+      "reward_std": 0.122966468334198,
+      "rewards/<lambda>": 0.13549137115478516,
+      "step": 1
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.02040816326530612,
+      "grad_norm": 0.1122620478272438,
+      "kl": 0.0,
+      "learning_rate": 2.0000000000000003e-06,
+      "loss": 0.0,
+      "reward": 0.1350748986005783,
+      "reward_std": 0.13239088654518127,
+      "rewards/<lambda>": 0.1350748986005783,
+      "step": 2
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.030612244897959183,
+      "grad_norm": 0.10055773705244064,
+      "kl": 0.0009823411237448454,
+      "learning_rate": 3e-06,
+      "loss": 0.0,
+      "reward": 0.11044053733348846,
+      "reward_std": 0.09149592369794846,
+      "rewards/<lambda>": 0.11044053733348846,
+      "step": 3
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.04081632653061224,
+      "grad_norm": 0.1106867864727974,
+      "kl": 0.0011138569097965956,
+      "learning_rate": 4.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.14424282312393188,
+      "reward_std": 0.08860041201114655,
+      "rewards/<lambda>": 0.14424282312393188,
+      "step": 4
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.05102040816326531,
+      "grad_norm": 0.10906749963760376,
+      "kl": 0.0009323786944150925,
+      "learning_rate": 5e-06,
+      "loss": 0.0,
+      "reward": 0.15370804071426392,
+      "reward_std": 0.12023884057998657,
+      "rewards/<lambda>": 0.15370804071426392,
+      "step": 5
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.061224489795918366,
+      "grad_norm": 0.0973278358578682,
+      "kl": 0.000986273866146803,
+      "learning_rate": 6e-06,
+      "loss": 0.0,
+      "reward": 0.16775861382484436,
+      "reward_std": 0.11324145644903183,
+      "rewards/<lambda>": 0.16775861382484436,
+      "step": 6
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.07142857142857142,
+      "grad_norm": 0.11650331318378448,
+      "kl": 0.0009559993632137775,
+      "learning_rate": 7.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.12161540985107422,
+      "reward_std": 0.09721754491329193,
+      "rewards/<lambda>": 0.12161540985107422,
+      "step": 7
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.08163265306122448,
+      "grad_norm": 0.14426998794078827,
+      "kl": 0.001172696123830974,
+      "learning_rate": 8.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.10382680594921112,
+      "reward_std": 0.08653440326452255,
+      "rewards/<lambda>": 0.10382680594921112,
+      "step": 8
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.09183673469387756,
+      "grad_norm": 0.09224210679531097,
+      "kl": 0.0009267181158065796,
+      "learning_rate": 9e-06,
+      "loss": 0.0,
+      "reward": 0.15248040854930878,
+      "reward_std": 0.10083242505788803,
+      "rewards/<lambda>": 0.15248040854930878,
+      "step": 9
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.10204081632653061,
+      "grad_norm": 0.10662151873111725,
+      "kl": 0.001049146056175232,
+      "learning_rate": 1e-05,
+      "loss": 0.0,
+      "reward": 0.10904872417449951,
+      "reward_std": 0.07905881851911545,
+      "rewards/<lambda>": 0.10904872417449951,
+      "step": 10
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.11224489795918367,
+      "grad_norm": 0.08880773186683655,
+      "kl": 0.0009575427393428981,
+      "learning_rate": 1.1000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.1023264229297638,
+      "reward_std": 0.09154930710792542,
+      "rewards/<lambda>": 0.1023264229297638,
+      "step": 11
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.12244897959183673,
+      "grad_norm": 0.1399761289358139,
+      "kl": 0.0010016188025474548,
+      "learning_rate": 1.2e-05,
+      "loss": 0.0,
+      "reward": 0.12774227559566498,
+      "reward_std": 0.10531343519687653,
+      "rewards/<lambda>": 0.12774227559566498,
+      "step": 12
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1326530612244898,
+      "grad_norm": 0.10604196786880493,
+      "kl": 0.0010334912221878767,
+      "learning_rate": 1.3000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.13598087430000305,
+      "reward_std": 0.12228463590145111,
+      "rewards/<lambda>": 0.13598087430000305,
+      "step": 13
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.14285714285714285,
+      "grad_norm": 0.1302865892648697,
+      "kl": 0.0010531357256695628,
+      "learning_rate": 1.4000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.09349755942821503,
+      "reward_std": 0.07072040438652039,
+      "rewards/<lambda>": 0.09349755942821503,
+      "step": 14
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.15306122448979592,
+      "grad_norm": 0.12415535002946854,
+      "kl": 0.0009964585769921541,
+      "learning_rate": 1.5e-05,
+      "loss": 0.0,
+      "reward": 0.11064597219228745,
+      "reward_std": 0.08005572855472565,
+      "rewards/<lambda>": 0.11064597219228745,
+      "step": 15
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.16326530612244897,
+      "grad_norm": 0.11816220730543137,
+      "kl": 0.0009764357237145305,
+      "learning_rate": 1.6000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.1383560299873352,
+      "reward_std": 0.09371937811374664,
+      "rewards/<lambda>": 0.1383560299873352,
+      "step": 16
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.17346938775510204,
+      "grad_norm": 0.10982600599527359,
+      "kl": 0.0009334891801699996,
+      "learning_rate": 1.7000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.12855608761310577,
+      "reward_std": 0.11154236644506454,
+      "rewards/<lambda>": 0.12855608761310577,
+      "step": 17
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1836734693877551,
+      "grad_norm": 0.09776873886585236,
+      "kl": 0.0008928977185860276,
+      "learning_rate": 1.8e-05,
+      "loss": 0.0,
+      "reward": 0.15613257884979248,
+      "reward_std": 0.13504436612129211,
+      "rewards/<lambda>": 0.15613257884979248,
+      "step": 18
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.19387755102040816,
+      "grad_norm": 0.12221962213516235,
+      "kl": 0.0010021468624472618,
+      "learning_rate": 1.9e-05,
+      "loss": 0.0,
+      "reward": 0.15936216711997986,
+      "reward_std": 0.09952296316623688,
+      "rewards/<lambda>": 0.15936216711997986,
+      "step": 19
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.20408163265306123,
+      "grad_norm": 0.09339983761310577,
+      "kl": 0.0011222452158108354,
+      "learning_rate": 2e-05,
+      "loss": 0.0,
+      "reward": 0.09720780700445175,
+      "reward_std": 0.08142109215259552,
+      "rewards/<lambda>": 0.09720780700445175,
+      "step": 20
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.21428571428571427,
+      "grad_norm": 0.11736761033535004,
+      "kl": 0.0011920102406293154,
+      "learning_rate": 2.1e-05,
+      "loss": 0.0,
+      "reward": 0.15125760436058044,
+      "reward_std": 0.10276186466217041,
+      "rewards/<lambda>": 0.15125760436058044,
+      "step": 21
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.22448979591836735,
+      "grad_norm": 0.13611504435539246,
+      "kl": 0.00114919594489038,
+      "learning_rate": 2.2000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.1826939433813095,
+      "reward_std": 0.12737450003623962,
+      "rewards/<lambda>": 0.1826939433813095,
+      "step": 22
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.23469387755102042,
+      "grad_norm": 0.12078937888145447,
+      "kl": 0.001010789768770337,
+      "learning_rate": 2.3000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.15208037197589874,
+      "reward_std": 0.13148412108421326,
+      "rewards/<lambda>": 0.15208037197589874,
+      "step": 23
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.24489795918367346,
+      "grad_norm": 0.14114625751972198,
+      "kl": 0.0011141165159642696,
+      "learning_rate": 2.4e-05,
+      "loss": 0.0,
+      "reward": 0.13319575786590576,
+      "reward_std": 0.10379491746425629,
+      "rewards/<lambda>": 0.13319575786590576,
+      "step": 24
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.25510204081632654,
+      "grad_norm": 0.08827733993530273,
+      "kl": 0.0010151572059839964,
+      "learning_rate": 2.5e-05,
+      "loss": 0.0,
+      "reward": 0.1206921935081482,
+      "reward_std": 0.11820320785045624,
+      "rewards/<lambda>": 0.1206921935081482,
+      "step": 25
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2653061224489796,
+      "grad_norm": 0.10461141169071198,
+      "kl": 0.0010629891185089946,
+      "learning_rate": 2.6000000000000002e-05,
+      "loss": 0.0,
+      "reward": 0.12660565972328186,
+      "reward_std": 0.09369075298309326,
+      "rewards/<lambda>": 0.12660565972328186,
+      "step": 26
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2755102040816326,
+      "grad_norm": 0.12817198038101196,
+      "kl": 0.0014134375378489494,
+      "learning_rate": 2.7000000000000002e-05,
+      "loss": 0.0001,
+      "reward": 0.15190695226192474,
+      "reward_std": 0.1361895054578781,
+      "rewards/<lambda>": 0.15190695226192474,
+      "step": 27
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2857142857142857,
+      "grad_norm": 0.14398054778575897,
+      "kl": 0.0014005769044160843,
+      "learning_rate": 2.8000000000000003e-05,
+      "loss": 0.0001,
+      "reward": 0.11002078652381897,
+      "reward_std": 0.08406772464513779,
+      "rewards/<lambda>": 0.11002078652381897,
+      "step": 28
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.29591836734693877,
+      "grad_norm": 0.12125135213136673,
+      "kl": 0.001406812109053135,
+      "learning_rate": 2.9e-05,
+      "loss": 0.0001,
+      "reward": 0.1488090455532074,
+      "reward_std": 0.09648909419775009,
+      "rewards/<lambda>": 0.1488090455532074,
+      "step": 29
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.30612244897959184,
+      "grad_norm": 0.1116093099117279,
+      "kl": 0.001532193971797824,
+      "learning_rate": 3e-05,
+      "loss": 0.0001,
+      "reward": 0.1401943415403366,
+      "reward_std": 0.10262932628393173,
+      "rewards/<lambda>": 0.1401943415403366,
+      "step": 30
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3163265306122449,
+      "grad_norm": 0.11590556800365448,
+      "kl": 0.0013668447500094771,
+      "learning_rate": 3.1e-05,
+      "loss": 0.0001,
+      "reward": 0.15992209315299988,
+      "reward_std": 0.13228771090507507,
+      "rewards/<lambda>": 0.15992209315299988,
+      "step": 31
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.32653061224489793,
+      "grad_norm": 0.22361916303634644,
+      "kl": 0.0015757133951410651,
+      "learning_rate": 3.2000000000000005e-05,
+      "loss": 0.0001,
+      "reward": 0.14634232223033905,
+      "reward_std": 0.0937654972076416,
+      "rewards/<lambda>": 0.14634232223033905,
+      "step": 32
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.336734693877551,
+      "grad_norm": 0.11585766822099686,
+      "kl": 0.0013871926348656416,
+      "learning_rate": 3.3e-05,
+      "loss": 0.0001,
+      "reward": 0.1371658891439438,
+      "reward_std": 0.0987061858177185,
+      "rewards/<lambda>": 0.1371658891439438,
+      "step": 33
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3469387755102041,
+      "grad_norm": 0.12178175151348114,
+      "kl": 0.00142570654861629,
+      "learning_rate": 3.4000000000000007e-05,
+      "loss": 0.0001,
+      "reward": 0.12351685017347336,
+      "reward_std": 0.11602398008108139,
+      "rewards/<lambda>": 0.12351685017347336,
+      "step": 34
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.35714285714285715,
+      "grad_norm": 0.0959494486451149,
+      "kl": 0.0016183136031031609,
+      "learning_rate": 3.5e-05,
+      "loss": 0.0001,
+      "reward": 0.08838998526334763,
+      "reward_std": 0.08110953867435455,
+      "rewards/<lambda>": 0.08838998526334763,
+      "step": 35
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3673469387755102,
+      "grad_norm": 0.10338612645864487,
+      "kl": 0.0017382192891091108,
+      "learning_rate": 3.6e-05,
+      "loss": 0.0001,
+      "reward": 0.16435359418392181,
+      "reward_std": 0.13644874095916748,
+      "rewards/<lambda>": 0.16435359418392181,
+      "step": 36
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.37755102040816324,
+      "grad_norm": 0.08886157721281052,
+      "kl": 0.00104477908462286,
+      "learning_rate": 3.7e-05,
+      "loss": 0.0,
+      "reward": 0.12062600255012512,
+      "reward_std": 0.10176925361156464,
+      "rewards/<lambda>": 0.12062600255012512,
+      "step": 37
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3877551020408163,
+      "grad_norm": 0.16644084453582764,
+      "kl": 0.001502887113019824,
+      "learning_rate": 3.8e-05,
+      "loss": 0.0001,
+      "reward": 0.12538418173789978,
+      "reward_std": 0.11342241615056992,
+      "rewards/<lambda>": 0.12538418173789978,
+      "step": 38
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3979591836734694,
+      "grad_norm": 0.10507523268461227,
+      "kl": 0.0016810973174870014,
+      "learning_rate": 3.9000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.1384812593460083,
+      "reward_std": 0.10060354322195053,
+      "rewards/<lambda>": 0.1384812593460083,
+      "step": 39
+    },
+    {
+      "completion_length": 156.89584350585938,
+      "epoch": 0.40816326530612246,
+      "grad_norm": 0.13816726207733154,
+      "kl": 0.002180024515837431,
+      "learning_rate": 4e-05,
+      "loss": 0.0001,
+      "reward": 0.1627415418624878,
+      "reward_std": 0.09559802711009979,
+      "rewards/<lambda>": 0.1627415418624878,
+      "step": 40
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.41836734693877553,
+      "grad_norm": 0.09595762193202972,
+      "kl": 0.001595992362126708,
+      "learning_rate": 4.1e-05,
+      "loss": 0.0001,
+      "reward": 0.1617720127105713,
+      "reward_std": 0.09887909144163132,
+      "rewards/<lambda>": 0.1617720127105713,
+      "step": 41
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.42857142857142855,
+      "grad_norm": 0.19110041856765747,
+      "kl": 0.002547662705183029,
+      "learning_rate": 4.2e-05,
+      "loss": 0.0001,
+      "reward": 0.19862735271453857,
+      "reward_std": 0.11392591893672943,
+      "rewards/<lambda>": 0.19862735271453857,
+      "step": 42
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4387755102040816,
+      "grad_norm": 0.09072164446115494,
+      "kl": 0.0016248103929683566,
+      "learning_rate": 4.3e-05,
+      "loss": 0.0001,
+      "reward": 0.13210207223892212,
+      "reward_std": 0.09667991101741791,
+      "rewards/<lambda>": 0.13210207223892212,
+      "step": 43
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4489795918367347,
+      "grad_norm": 0.10356740653514862,
+      "kl": 0.002113830065354705,
+      "learning_rate": 4.4000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.14277184009552002,
+      "reward_std": 0.09977956861257553,
+      "rewards/<lambda>": 0.14277184009552002,
+      "step": 44
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.45918367346938777,
+      "grad_norm": 0.1663713902235031,
+      "kl": 0.0063522434793412685,
+      "learning_rate": 4.5e-05,
+      "loss": 0.0003,
+      "reward": 0.16242042183876038,
+      "reward_std": 0.12298040091991425,
+      "rewards/<lambda>": 0.16242042183876038,
+      "step": 45
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.46938775510204084,
+      "grad_norm": 0.11778972297906876,
+      "kl": 0.0025551035068929195,
+      "learning_rate": 4.600000000000001e-05,
+      "loss": 0.0001,
+      "reward": 0.12898343801498413,
+      "reward_std": 0.09911265969276428,
+      "rewards/<lambda>": 0.12898343801498413,
+      "step": 46
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.47959183673469385,
+      "grad_norm": 0.13733811676502228,
+      "kl": 0.004000469576567411,
+      "learning_rate": 4.7e-05,
+      "loss": 0.0002,
+      "reward": 0.18428276479244232,
+      "reward_std": 0.1296135038137436,
+      "rewards/<lambda>": 0.18428276479244232,
+      "step": 47
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4897959183673469,
+      "grad_norm": 0.10343091934919357,
+      "kl": 0.0036480827257037163,
+      "learning_rate": 4.8e-05,
+      "loss": 0.0001,
+      "reward": 0.1390620619058609,
+      "reward_std": 0.10592266172170639,
+      "rewards/<lambda>": 0.1390620619058609,
+      "step": 48
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5,
+      "grad_norm": 0.1648188680410385,
+      "kl": 0.003458770690485835,
+      "learning_rate": 4.9e-05,
+      "loss": 0.0001,
+      "reward": 0.16380682587623596,
+      "reward_std": 0.10548572242259979,
+      "rewards/<lambda>": 0.16380682587623596,
+      "step": 49
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5102040816326531,
+      "grad_norm": 0.0953877791762352,
+      "kl": 0.002513691782951355,
+      "learning_rate": 5e-05,
+      "loss": 0.0001,
+      "reward": 0.14581327140331268,
+      "reward_std": 0.12193475663661957,
+      "rewards/<lambda>": 0.14581327140331268,
+      "step": 50
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5204081632653061,
+      "grad_norm": 0.14063876867294312,
+      "kl": 0.004633021075278521,
+      "learning_rate": 5.1000000000000006e-05,
+      "loss": 0.0002,
+      "reward": 0.20702508091926575,
+      "reward_std": 0.13710586726665497,
+      "rewards/<lambda>": 0.20702508091926575,
+      "step": 51
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5306122448979592,
+      "grad_norm": 0.1019497662782669,
+      "kl": 0.0037124312948435545,
+      "learning_rate": 5.2000000000000004e-05,
+      "loss": 0.0001,
+      "reward": 0.13214847445487976,
+      "reward_std": 0.09029597043991089,
+      "rewards/<lambda>": 0.13214847445487976,
+      "step": 52
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5408163265306123,
+      "grad_norm": 0.12852047383785248,
+      "kl": 0.004707551561295986,
+      "learning_rate": 5.300000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.1572979986667633,
+      "reward_std": 0.11500580608844757,
+      "rewards/<lambda>": 0.1572979986667633,
+      "step": 53
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5510204081632653,
+      "grad_norm": 0.1126970499753952,
+      "kl": 0.0029443646781146526,
+      "learning_rate": 5.4000000000000005e-05,
+      "loss": 0.0001,
+      "reward": 0.11422993242740631,
+      "reward_std": 0.1095927357673645,
+      "rewards/<lambda>": 0.11422993242740631,
+      "step": 54
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5612244897959183,
+      "grad_norm": 0.1265789270401001,
+      "kl": 0.004524104297161102,
+      "learning_rate": 5.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.1222875714302063,
+      "reward_std": 0.0923609733581543,
+      "rewards/<lambda>": 0.1222875714302063,
+      "step": 55
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5714285714285714,
+      "grad_norm": 0.12009646743535995,
+      "kl": 0.0038815774023532867,
+      "learning_rate": 5.6000000000000006e-05,
+      "loss": 0.0002,
+      "reward": 0.1649770438671112,
+      "reward_std": 0.10861348360776901,
+      "rewards/<lambda>": 0.1649770438671112,
+      "step": 56
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5816326530612245,
+      "grad_norm": 0.1011093482375145,
+      "kl": 0.0032921030651777983,
+      "learning_rate": 5.6999999999999996e-05,
+      "loss": 0.0001,
+      "reward": 0.17336896061897278,
+      "reward_std": 0.12328130006790161,
+      "rewards/<lambda>": 0.17336896061897278,
+      "step": 57
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5918367346938775,
+      "grad_norm": 0.10525462031364441,
+      "kl": 0.004167443141341209,
+      "learning_rate": 5.8e-05,
+      "loss": 0.0002,
+      "reward": 0.1390347182750702,
+      "reward_std": 0.11740852892398834,
+      "rewards/<lambda>": 0.1390347182750702,
+      "step": 58
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6020408163265306,
+      "grad_norm": 0.3547665774822235,
+      "kl": 0.00994147639721632,
+      "learning_rate": 5.9e-05,
+      "loss": 0.0004,
+      "reward": 0.1372447907924652,
+      "reward_std": 0.08885445445775986,
+      "rewards/<lambda>": 0.1372447907924652,
+      "step": 59
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6122448979591837,
+      "grad_norm": 0.14041198790073395,
+      "kl": 0.005412312224507332,
+      "learning_rate": 6e-05,
+      "loss": 0.0002,
+      "reward": 0.14578765630722046,
+      "reward_std": 0.095134437084198,
+      "rewards/<lambda>": 0.14578765630722046,
+      "step": 60
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6224489795918368,
+      "grad_norm": 0.13052710890769958,
+      "kl": 0.004689443856477737,
+      "learning_rate": 6.1e-05,
+      "loss": 0.0002,
+      "reward": 0.09727954864501953,
+      "reward_std": 0.06896870583295822,
+      "rewards/<lambda>": 0.09727954864501953,
+      "step": 61
+    },
+    {
+      "completion_length": 157.1875,
+      "epoch": 0.6326530612244898,
+      "grad_norm": 0.1129918321967125,
+      "kl": 0.004512472078204155,
+      "learning_rate": 6.2e-05,
+      "loss": 0.0002,
+      "reward": 0.17984223365783691,
+      "reward_std": 0.09912577271461487,
+      "rewards/<lambda>": 0.17984223365783691,
+      "step": 62
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6428571428571429,
+      "grad_norm": 0.11350245773792267,
+      "kl": 0.005884611513465643,
+      "learning_rate": 6.3e-05,
+      "loss": 0.0002,
+      "reward": 0.13487395644187927,
+      "reward_std": 0.09702014178037643,
+      "rewards/<lambda>": 0.13487395644187927,
+      "step": 63
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6530612244897959,
+      "grad_norm": 0.1478883922100067,
+      "kl": 0.0060376739129424095,
+      "learning_rate": 6.400000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.13012471795082092,
+      "reward_std": 0.09517987817525864,
+      "rewards/<lambda>": 0.13012471795082092,
+      "step": 64
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6632653061224489,
+      "grad_norm": 0.09250291436910629,
+      "kl": 0.005326719488948584,
+      "learning_rate": 6.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.17718777060508728,
+      "reward_std": 0.11379577964544296,
+      "rewards/<lambda>": 0.17718777060508728,
+      "step": 65
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.673469387755102,
+      "grad_norm": 0.18939143419265747,
+      "kl": 0.0049583157524466515,
+      "learning_rate": 6.6e-05,
+      "loss": 0.0002,
+      "reward": 0.14007148146629333,
+      "reward_std": 0.09873458743095398,
+      "rewards/<lambda>": 0.14007148146629333,
+      "step": 66
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6836734693877551,
+      "grad_norm": 0.09305741637945175,
+      "kl": 0.00586826354265213,
+      "learning_rate": 6.7e-05,
+      "loss": 0.0002,
+      "reward": 0.10462306439876556,
+      "reward_std": 0.09117605537176132,
+      "rewards/<lambda>": 0.10462306439876556,
+      "step": 67
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6938775510204082,
+      "grad_norm": 0.14123046398162842,
+      "kl": 0.006275478284806013,
+      "learning_rate": 6.800000000000001e-05,
+      "loss": 0.0003,
+      "reward": 0.16492299735546112,
+      "reward_std": 0.1278476119041443,
+      "rewards/<lambda>": 0.16492299735546112,
+      "step": 68
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7040816326530612,
+      "grad_norm": 0.10926412045955658,
+      "kl": 0.005929266568273306,
+      "learning_rate": 6.9e-05,
+      "loss": 0.0002,
+      "reward": 0.1173870787024498,
+      "reward_std": 0.09027393907308578,
+      "rewards/<lambda>": 0.1173870787024498,
+      "step": 69
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7142857142857143,
+      "grad_norm": 0.11727511882781982,
+      "kl": 0.007982933893799782,
+      "learning_rate": 7e-05,
+      "loss": 0.0003,
+      "reward": 0.1433873325586319,
+      "reward_std": 0.09991258382797241,
+      "rewards/<lambda>": 0.1433873325586319,
+      "step": 70
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7244897959183674,
+      "grad_norm": 0.09994583576917648,
+      "kl": 0.008151376619935036,
+      "learning_rate": 7.1e-05,
+      "loss": 0.0003,
+      "reward": 0.17650912702083588,
+      "reward_std": 0.1265939176082611,
+      "rewards/<lambda>": 0.17650912702083588,
+      "step": 71
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7346938775510204,
+      "grad_norm": 0.11233754456043243,
+      "kl": 0.008668525144457817,
+      "learning_rate": 7.2e-05,
+      "loss": 0.0003,
+      "reward": 0.1490788459777832,
+      "reward_std": 0.10998384654521942,
+      "rewards/<lambda>": 0.1490788459777832,
+      "step": 72
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7448979591836735,
+      "grad_norm": 0.15177902579307556,
+      "kl": 0.008991223759949207,
+      "learning_rate": 7.3e-05,
+      "loss": 0.0004,
+      "reward": 0.1158258318901062,
+      "reward_std": 0.09271200001239777,
+      "rewards/<lambda>": 0.1158258318901062,
+      "step": 73
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7551020408163265,
+      "grad_norm": 0.12361659109592438,
+      "kl": 0.009935474023222923,
+      "learning_rate": 7.4e-05,
+      "loss": 0.0004,
+      "reward": 0.16675977408885956,
+      "reward_std": 0.11429008096456528,
+      "rewards/<lambda>": 0.16675977408885956,
+      "step": 74
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7653061224489796,
+      "grad_norm": 0.10791321843862534,
+      "kl": 0.01029084250330925,
+      "learning_rate": 7.500000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.1560361236333847,
+      "reward_std": 0.12631307542324066,
+      "rewards/<lambda>": 0.1560361236333847,
+      "step": 75
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7755102040816326,
+      "grad_norm": 0.11729063093662262,
+      "kl": 0.013020636513829231,
+      "learning_rate": 7.6e-05,
+      "loss": 0.0005,
+      "reward": 0.13625627756118774,
+      "reward_std": 0.09627413749694824,
+      "rewards/<lambda>": 0.13625627756118774,
+      "step": 76
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7857142857142857,
+      "grad_norm": 0.14529183506965637,
+      "kl": 0.016384603455662727,
+      "learning_rate": 7.7e-05,
+      "loss": 0.0007,
+      "reward": 0.1917780488729477,
+      "reward_std": 0.10966574400663376,
+      "rewards/<lambda>": 0.1917780488729477,
+      "step": 77
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7959183673469388,
+      "grad_norm": 0.11583904176950455,
+      "kl": 0.016116080805659294,
+      "learning_rate": 7.800000000000001e-05,
+      "loss": 0.0006,
+      "reward": 0.1915496587753296,
+      "reward_std": 0.12190195918083191,
+      "rewards/<lambda>": 0.1915496587753296,
+      "step": 78
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8061224489795918,
+      "grad_norm": 0.10830377042293549,
+      "kl": 0.015441562049090862,
+      "learning_rate": 7.900000000000001e-05,
+      "loss": 0.0006,
+      "reward": 0.15522490441799164,
+      "reward_std": 0.08264317363500595,
+      "rewards/<lambda>": 0.15522490441799164,
+      "step": 79
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8163265306122449,
+      "grad_norm": 0.11148685216903687,
+      "kl": 0.01857638917863369,
+      "learning_rate": 8e-05,
+      "loss": 0.0007,
+      "reward": 0.21693138778209686,
+      "reward_std": 0.132161483168602,
+      "rewards/<lambda>": 0.21693138778209686,
+      "step": 80
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.826530612244898,
+      "grad_norm": 0.1557169258594513,
+      "kl": 0.01583457738161087,
+      "learning_rate": 8.1e-05,
+      "loss": 0.0006,
+      "reward": 0.1813575029373169,
+      "reward_std": 0.12332402914762497,
+      "rewards/<lambda>": 0.1813575029373169,
+      "step": 81
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8367346938775511,
+      "grad_norm": 0.11910691857337952,
+      "kl": 0.025130536407232285,
+      "learning_rate": 8.2e-05,
+      "loss": 0.001,
+      "reward": 0.1874808669090271,
+      "reward_std": 0.13184116780757904,
+      "rewards/<lambda>": 0.1874808669090271,
+      "step": 82
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8469387755102041,
+      "grad_norm": 0.09641101211309433,
+      "kl": 0.020103760063648224,
+      "learning_rate": 8.3e-05,
+      "loss": 0.0008,
+      "reward": 0.16875742375850677,
+      "reward_std": 0.12786784768104553,
+      "rewards/<lambda>": 0.16875742375850677,
+      "step": 83
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8571428571428571,
+      "grad_norm": 0.11505434662103653,
+      "kl": 0.03201855719089508,
+      "learning_rate": 8.4e-05,
+      "loss": 0.0013,
+      "reward": 0.1941194385290146,
+      "reward_std": 0.1352878361940384,
+      "rewards/<lambda>": 0.1941194385290146,
+      "step": 84
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8673469387755102,
+      "grad_norm": 0.20420552790164948,
+      "kl": 0.025220511481165886,
+      "learning_rate": 8.5e-05,
+      "loss": 0.001,
+      "reward": 0.1499064564704895,
+      "reward_std": 0.10931651294231415,
+      "rewards/<lambda>": 0.1499064564704895,
+      "step": 85
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8775510204081632,
+      "grad_norm": 0.1504613608121872,
+      "kl": 0.030149614438414574,
+      "learning_rate": 8.6e-05,
+      "loss": 0.0012,
+      "reward": 0.2021590918302536,
+      "reward_std": 0.11808289587497711,
+      "rewards/<lambda>": 0.2021590918302536,
+      "step": 86
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8877551020408163,
+      "grad_norm": 0.12947432696819305,
+      "kl": 0.03943827003240585,
+      "learning_rate": 8.7e-05,
+      "loss": 0.0016,
+      "reward": 0.16302035748958588,
+      "reward_std": 0.12248598039150238,
+      "rewards/<lambda>": 0.16302035748958588,
+      "step": 87
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8979591836734694,
+      "grad_norm": 0.10003241151571274,
+      "kl": 0.031025437638163567,
+      "learning_rate": 8.800000000000001e-05,
+      "loss": 0.0012,
+      "reward": 0.17104387283325195,
+      "reward_std": 0.11305706948041916,
+      "rewards/<lambda>": 0.17104387283325195,
+      "step": 88
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9081632653061225,
+      "grad_norm": 0.178684264421463,
+      "kl": 0.0443422794342041,
+      "learning_rate": 8.900000000000001e-05,
+      "loss": 0.0018,
+      "reward": 0.21491599082946777,
+      "reward_std": 0.11920367181301117,
+      "rewards/<lambda>": 0.21491599082946777,
+      "step": 89
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9183673469387755,
+      "grad_norm": 0.4619559347629547,
+      "kl": 0.14378872513771057,
+      "learning_rate": 9e-05,
+      "loss": 0.0058,
+      "reward": 0.170355886220932,
+      "reward_std": 0.11420422047376633,
+      "rewards/<lambda>": 0.170355886220932,
+      "step": 90
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9285714285714286,
+      "grad_norm": 0.18158575892448425,
+      "kl": 0.05551178380846977,
+      "learning_rate": 9.1e-05,
+      "loss": 0.0022,
+      "reward": 0.17519733309745789,
+      "reward_std": 0.1182706207036972,
+      "rewards/<lambda>": 0.17519733309745789,
+      "step": 91
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9387755102040817,
+      "grad_norm": 0.1649360954761505,
+      "kl": 0.0335373729467392,
+      "learning_rate": 9.200000000000001e-05,
+      "loss": 0.0013,
+      "reward": 0.19423919916152954,
+      "reward_std": 0.13439567387104034,
+      "rewards/<lambda>": 0.19423919916152954,
+      "step": 92
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9489795918367347,
+      "grad_norm": 0.10932810604572296,
+      "kl": 0.035153284668922424,
+      "learning_rate": 9.300000000000001e-05,
+      "loss": 0.0014,
+      "reward": 0.15611481666564941,
+      "reward_std": 0.09644447267055511,
+      "rewards/<lambda>": 0.15611481666564941,
+      "step": 93
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9591836734693877,
+      "grad_norm": 0.099422886967659,
+      "kl": 0.03238137811422348,
+      "learning_rate": 9.4e-05,
+      "loss": 0.0013,
+      "reward": 0.1927059292793274,
+      "reward_std": 0.11067558825016022,
+      "rewards/<lambda>": 0.1927059292793274,
+      "step": 94
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9693877551020408,
+      "grad_norm": 0.1119072213768959,
+      "kl": 0.027582138776779175,
+      "learning_rate": 9.5e-05,
+      "loss": 0.0011,
+      "reward": 0.2055554836988449,
+      "reward_std": 0.14050546288490295,
+      "rewards/<lambda>": 0.2055554836988449,
+      "step": 95
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9795918367346939,
+      "grad_norm": 0.119843028485775,
+      "kl": 0.029401419684290886,
+      "learning_rate": 9.6e-05,
+      "loss": 0.0012,
+      "reward": 0.19861146807670593,
+      "reward_std": 0.12701785564422607,
+      "rewards/<lambda>": 0.19861146807670593,
+      "step": 96
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9897959183673469,
+      "grad_norm": 0.10992871224880219,
+      "kl": 0.03312207758426666,
+      "learning_rate": 9.7e-05,
+      "loss": 0.0013,
+      "reward": 0.15577931702136993,
+      "reward_std": 0.10662981122732162,
+      "rewards/<lambda>": 0.15577931702136993,
+      "step": 97
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0,
+      "grad_norm": 0.13580623269081116,
+      "kl": 0.0318591445684433,
+      "learning_rate": 9.8e-05,
+      "loss": 0.0013,
+      "reward": 0.256815642118454,
+      "reward_std": 0.17193631827831268,
+      "rewards/<lambda>": 0.256815642118454,
+      "step": 98
+    }
+  ],
+  "logging_steps": 1,
+  "max_steps": 98,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 1,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-98/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c8aa9ac77fddcdbab49223b9355a1fad052dcac253468ef871ea988060bcd52
+size 5560

checkpoint-98/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff