Instructions to use camilablank/pirate-data-steer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use camilablank/pirate-data-steer with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "camilablank/pirate-data-steer")

Transformers

How to use camilablank/pirate-data-steer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="camilablank/pirate-data-steer")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("camilablank/pirate-data-steer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use camilablank/pirate-data-steer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "camilablank/pirate-data-steer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "camilablank/pirate-data-steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/camilablank/pirate-data-steer

SGLang

How to use camilablank/pirate-data-steer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "camilablank/pirate-data-steer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "camilablank/pirate-data-steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "camilablank/pirate-data-steer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "camilablank/pirate-data-steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use camilablank/pirate-data-steer with Docker Model Runner:
```
docker model run hf.co/camilablank/pirate-data-steer
```

camilablank commited on Apr 21

Commit

1bc2c2e

verified ·

1 Parent(s): 8a3b309

Upload pirate_L16_a150 seed_42 (final adapter + all intermediate checkpoints)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +11 -0
README.md +62 -0
adapter_config.json +48 -0
adapter_model.safetensors +3 -0
chat_template.jinja +54 -0
checkpoint-1086/README.md +209 -0
checkpoint-1086/adapter_config.json +48 -0
checkpoint-1086/adapter_model.safetensors +3 -0
checkpoint-1086/chat_template.jinja +54 -0
checkpoint-1086/tokenizer.json +3 -0
checkpoint-1086/tokenizer_config.json +29 -0
checkpoint-1086/trainer_state.json +1136 -0
checkpoint-1086/training_args.bin +3 -0
checkpoint-1629/README.md +209 -0
checkpoint-1629/adapter_config.json +48 -0
checkpoint-1629/adapter_model.safetensors +3 -0
checkpoint-1629/chat_template.jinja +54 -0
checkpoint-1629/tokenizer.json +3 -0
checkpoint-1629/tokenizer_config.json +29 -0
checkpoint-1629/trainer_state.json +1687 -0
checkpoint-1629/training_args.bin +3 -0
checkpoint-2172/README.md +209 -0
checkpoint-2172/adapter_config.json +48 -0
checkpoint-2172/adapter_model.safetensors +3 -0
checkpoint-2172/chat_template.jinja +54 -0
checkpoint-2172/tokenizer.json +3 -0
checkpoint-2172/tokenizer_config.json +29 -0
checkpoint-2172/trainer_state.json +2248 -0
checkpoint-2172/training_args.bin +3 -0
checkpoint-2715/README.md +209 -0
checkpoint-2715/adapter_config.json +48 -0
checkpoint-2715/adapter_model.safetensors +3 -0
checkpoint-2715/chat_template.jinja +54 -0
checkpoint-2715/tokenizer.json +3 -0
checkpoint-2715/tokenizer_config.json +29 -0
checkpoint-2715/trainer_state.json +2799 -0
checkpoint-2715/training_args.bin +3 -0
checkpoint-3258/README.md +209 -0
checkpoint-3258/adapter_config.json +48 -0
checkpoint-3258/adapter_model.safetensors +3 -0
checkpoint-3258/chat_template.jinja +54 -0
checkpoint-3258/tokenizer.json +3 -0
checkpoint-3258/tokenizer_config.json +29 -0
checkpoint-3258/trainer_state.json +0 -0
checkpoint-3258/training_args.bin +3 -0
checkpoint-3801/README.md +209 -0
checkpoint-3801/adapter_config.json +48 -0
checkpoint-3801/adapter_model.safetensors +3 -0
checkpoint-3801/chat_template.jinja +54 -0
checkpoint-3801/tokenizer.json +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+checkpoint-1086/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-1629/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-2172/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-2715/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-3258/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-3801/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-4344/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-4887/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-543/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+checkpoint-5430/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+model_name: seed_42
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+licence: license
+pipeline_tag: text-generation
+---
+# Model Card for seed_42
+This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).
+It has been trained using [TRL](https://github.com/huggingface/trl).
+## Quick start
+```python
+from transformers import pipeline
+question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
+generator = pipeline("text-generation", model="None", device="cuda")
+output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
+print(output["generated_text"])
+```
+## Training procedure
+[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/camilab-stanford-university/subliminal_learning/runs/9746rvh5)
+This model was trained with SFT.
+### Framework versions
+- PEFT 0.19.1
+- TRL: 1.2.0
+- Transformers: 5.5.4
+- Pytorch: 2.10.0
+- Datasets: 4.8.4
+- Tokenizers: 0.22.2
+## Citations
+Cite TRL as:
+```bibtex
+@software{vonwerra2020trl,
+  title   = {{TRL: Transformers Reinforcement Learning}},
+  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
+  license = {Apache-2.0},
+  url     = {https://github.com/huggingface/trl},
+  year    = {2020}
+}
+```

adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44a37f4e701d0d74b2f3087cbbb4d2cce354e5bfcb1651dbb4d48e82fd2234d7
+size 80792096

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-1086/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-1086/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-1086/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eace20f3ff41af75c2ea9c9643e6063d8734d60b4b3bcec95f05c8940d3430be
+size 80792096

checkpoint-1086/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-1086/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

checkpoint-1086/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

checkpoint-1086/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1136 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 2.0,
+  "eval_steps": 500,
+  "global_step": 1086,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "entropy": 1.2237394809722901,
+      "epoch": 0.01841620626151013,
+      "grad_norm": 5.082435607910156,
+      "learning_rate": 3.308823529411765e-06,
+      "loss": 0.9237876892089844,
+      "mean_token_accuracy": 0.7685343027114868,
+      "num_tokens": 205423.0,
+      "step": 10
+    },
+    {
+      "entropy": 1.2295925617218018,
+      "epoch": 0.03683241252302026,
+      "grad_norm": 4.672000408172607,
+      "learning_rate": 6.985294117647059e-06,
+      "loss": 0.8900892257690429,
+      "mean_token_accuracy": 0.7677771031856537,
+      "num_tokens": 410849.0,
+      "step": 20
+    },
+    {
+      "entropy": 1.2285718679428101,
+      "epoch": 0.055248618784530384,
+      "grad_norm": 1.4828118085861206,
+      "learning_rate": 1.0661764705882354e-05,
+      "loss": 0.5975452899932862,
+      "mean_token_accuracy": 0.8146551787853241,
+      "num_tokens": 616438.0,
+      "step": 30
+    },
+    {
+      "entropy": 1.210776400566101,
+      "epoch": 0.07366482504604052,
+      "grad_norm": 0.7761328816413879,
+      "learning_rate": 1.4338235294117647e-05,
+      "loss": 0.40664992332458494,
+      "mean_token_accuracy": 0.8699092030525207,
+      "num_tokens": 822118.0,
+      "step": 40
+    },
+    {
+      "entropy": 1.200321125984192,
+      "epoch": 0.09208103130755065,
+      "grad_norm": 0.5363371968269348,
+      "learning_rate": 1.8014705882352943e-05,
+      "loss": 0.3313469409942627,
+      "mean_token_accuracy": 0.8904915869235992,
+      "num_tokens": 1027941.0,
+      "step": 50
+    },
+    {
+      "entropy": 1.1809936046600342,
+      "epoch": 0.11049723756906077,
+      "grad_norm": 0.39541518688201904,
+      "learning_rate": 2.1691176470588237e-05,
+      "loss": 0.27568228244781495,
+      "mean_token_accuracy": 0.9047131836414337,
+      "num_tokens": 1233620.0,
+      "step": 60
+    },
+    {
+      "entropy": 1.169810914993286,
+      "epoch": 0.1289134438305709,
+      "grad_norm": 0.341960072517395,
+      "learning_rate": 2.536764705882353e-05,
+      "loss": 0.245219087600708,
+      "mean_token_accuracy": 0.9150686681270599,
+      "num_tokens": 1438656.0,
+      "step": 70
+    },
+    {
+      "entropy": 1.1652960777282715,
+      "epoch": 0.14732965009208104,
+      "grad_norm": 0.36872178316116333,
+      "learning_rate": 2.9044117647058828e-05,
+      "loss": 0.2220149040222168,
+      "mean_token_accuracy": 0.9224777698516846,
+      "num_tokens": 1643877.0,
+      "step": 80
+    },
+    {
+      "entropy": 1.154341197013855,
+      "epoch": 0.16574585635359115,
+      "grad_norm": 0.4152425229549408,
+      "learning_rate": 3.272058823529412e-05,
+      "loss": 0.2002798557281494,
+      "mean_token_accuracy": 0.9285802960395813,
+      "num_tokens": 1849506.0,
+      "step": 90
+    },
+    {
+      "entropy": 1.1507258892059327,
+      "epoch": 0.1841620626151013,
+      "grad_norm": 0.47647765278816223,
+      "learning_rate": 3.639705882352941e-05,
+      "loss": 0.18871363401412963,
+      "mean_token_accuracy": 0.9318056285381318,
+      "num_tokens": 2055071.0,
+      "step": 100
+    },
+    {
+      "entropy": 1.1455535531044005,
+      "epoch": 0.20257826887661143,
+      "grad_norm": 0.4853009581565857,
+      "learning_rate": 4.007352941176471e-05,
+      "loss": 0.17836341857910157,
+      "mean_token_accuracy": 0.9367631554603577,
+      "num_tokens": 2260643.0,
+      "step": 110
+    },
+    {
+      "entropy": 1.1402526497840881,
+      "epoch": 0.22099447513812154,
+      "grad_norm": 0.4455392360687256,
+      "learning_rate": 4.375e-05,
+      "loss": 0.16921783685684205,
+      "mean_token_accuracy": 0.9386959195137023,
+      "num_tokens": 2466085.0,
+      "step": 120
+    },
+    {
+      "entropy": 1.1374777555465698,
+      "epoch": 0.23941068139963168,
+      "grad_norm": 0.5880279541015625,
+      "learning_rate": 4.742647058823529e-05,
+      "loss": 0.15989291667938232,
+      "mean_token_accuracy": 0.9421182632446289,
+      "num_tokens": 2671024.0,
+      "step": 130
+    },
+    {
+      "entropy": 1.1273940205574036,
+      "epoch": 0.2578268876611418,
+      "grad_norm": 0.612959086894989,
+      "learning_rate": 5.110294117647059e-05,
+      "loss": 0.14701461791992188,
+      "mean_token_accuracy": 0.9463540315628052,
+      "num_tokens": 2876848.0,
+      "step": 140
+    },
+    {
+      "entropy": 1.1263513088226318,
+      "epoch": 0.27624309392265195,
+      "grad_norm": 0.5695255398750305,
+      "learning_rate": 5.477941176470589e-05,
+      "loss": 0.14604382514953612,
+      "mean_token_accuracy": 0.946351945400238,
+      "num_tokens": 3082589.0,
+      "step": 150
+    },
+    {
+      "entropy": 1.1290789365768432,
+      "epoch": 0.2946593001841621,
+      "grad_norm": 0.6608090996742249,
+      "learning_rate": 5.845588235294118e-05,
+      "loss": 0.1409450054168701,
+      "mean_token_accuracy": 0.9481450319290161,
+      "num_tokens": 3287459.0,
+      "step": 160
+    },
+    {
+      "entropy": 1.1291529774665832,
+      "epoch": 0.31307550644567217,
+      "grad_norm": 0.652715802192688,
+      "learning_rate": 6.213235294117647e-05,
+      "loss": 0.14441155195236205,
+      "mean_token_accuracy": 0.9466125547885895,
+      "num_tokens": 3493682.0,
+      "step": 170
+    },
+    {
+      "entropy": 1.1244838953018188,
+      "epoch": 0.3314917127071823,
+      "grad_norm": 0.7815241813659668,
+      "learning_rate": 6.580882352941177e-05,
+      "loss": 0.13361064195632935,
+      "mean_token_accuracy": 0.9512295544147491,
+      "num_tokens": 3699573.0,
+      "step": 180
+    },
+    {
+      "entropy": 1.1217721104621887,
+      "epoch": 0.34990791896869244,
+      "grad_norm": 0.7933160066604614,
+      "learning_rate": 6.948529411764706e-05,
+      "loss": 0.13089522123336791,
+      "mean_token_accuracy": 0.9520221531391144,
+      "num_tokens": 3905156.0,
+      "step": 190
+    },
+    {
+      "entropy": 1.1206679105758668,
+      "epoch": 0.3683241252302026,
+      "grad_norm": 0.6815240383148193,
+      "learning_rate": 7.316176470588236e-05,
+      "loss": 0.13400404453277587,
+      "mean_token_accuracy": 0.9501322209835052,
+      "num_tokens": 4110570.0,
+      "step": 200
+    },
+    {
+      "entropy": 1.1161052227020263,
+      "epoch": 0.3867403314917127,
+      "grad_norm": 0.8297767639160156,
+      "learning_rate": 7.683823529411766e-05,
+      "loss": 0.13389937877655028,
+      "mean_token_accuracy": 0.9501932203769684,
+      "num_tokens": 4315834.0,
+      "step": 210
+    },
+    {
+      "entropy": 1.1098745942115784,
+      "epoch": 0.40515653775322286,
+      "grad_norm": 0.5943381786346436,
+      "learning_rate": 8.051470588235294e-05,
+      "loss": 0.13452907800674438,
+      "mean_token_accuracy": 0.9503286242485046,
+      "num_tokens": 4520807.0,
+      "step": 220
+    },
+    {
+      "entropy": 1.100480353832245,
+      "epoch": 0.42357274401473294,
+      "grad_norm": 0.6094359755516052,
+      "learning_rate": 8.419117647058824e-05,
+      "loss": 0.12827746868133544,
+      "mean_token_accuracy": 0.952492094039917,
+      "num_tokens": 4725867.0,
+      "step": 230
+    },
+    {
+      "entropy": 1.0901286959648133,
+      "epoch": 0.4419889502762431,
+      "grad_norm": 0.7240597605705261,
+      "learning_rate": 8.786764705882353e-05,
+      "loss": 0.12171242237091065,
+      "mean_token_accuracy": 0.953943532705307,
+      "num_tokens": 4931629.0,
+      "step": 240
+    },
+    {
+      "entropy": 1.0885071873664856,
+      "epoch": 0.4604051565377532,
+      "grad_norm": 0.6939547657966614,
+      "learning_rate": 9.154411764705882e-05,
+      "loss": 0.12155698537826538,
+      "mean_token_accuracy": 0.9545870959758759,
+      "num_tokens": 5137285.0,
+      "step": 250
+    },
+    {
+      "entropy": 1.086272156238556,
+      "epoch": 0.47882136279926335,
+      "grad_norm": 0.5752800703048706,
+      "learning_rate": 9.522058823529412e-05,
+      "loss": 0.12157790660858155,
+      "mean_token_accuracy": 0.9541126549243927,
+      "num_tokens": 5342575.0,
+      "step": 260
+    },
+    {
+      "entropy": 1.0857678413391114,
+      "epoch": 0.4972375690607735,
+      "grad_norm": 0.7565123438835144,
+      "learning_rate": 9.889705882352942e-05,
+      "loss": 0.12349612712860107,
+      "mean_token_accuracy": 0.9535140514373779,
+      "num_tokens": 5547995.0,
+      "step": 270
+    },
+    {
+      "entropy": 1.079762625694275,
+      "epoch": 0.5156537753222836,
+      "grad_norm": 0.6972768306732178,
+      "learning_rate": 9.999954556423843e-05,
+      "loss": 0.11875582933425903,
+      "mean_token_accuracy": 0.9556483089923858,
+      "num_tokens": 5753195.0,
+      "step": 280
+    },
+    {
+      "entropy": 1.0742079138755798,
+      "epoch": 0.5340699815837937,
+      "grad_norm": 0.7821696996688843,
+      "learning_rate": 9.999731977631227e-05,
+      "loss": 0.11824090480804443,
+      "mean_token_accuracy": 0.9557521045207977,
+      "num_tokens": 5958236.0,
+      "step": 290
+    },
+    {
+      "entropy": 1.0679773569107056,
+      "epoch": 0.5524861878453039,
+      "grad_norm": 0.5846888422966003,
+      "learning_rate": 9.999323925089486e-05,
+      "loss": 0.11707355976104736,
+      "mean_token_accuracy": 0.9554719448089599,
+      "num_tokens": 6163992.0,
+      "step": 300
+    },
+    {
+      "entropy": 1.0655727863311768,
+      "epoch": 0.570902394106814,
+      "grad_norm": 0.5812502503395081,
+      "learning_rate": 9.998730413936037e-05,
+      "loss": 0.11371417045593261,
+      "mean_token_accuracy": 0.9576376020908356,
+      "num_tokens": 6369456.0,
+      "step": 310
+    },
+    {
+      "entropy": 1.0607039332389832,
+      "epoch": 0.5893186003683242,
+      "grad_norm": 0.6238475441932678,
+      "learning_rate": 9.99795146618821e-05,
+      "loss": 0.11775733232498169,
+      "mean_token_accuracy": 0.9557221591472626,
+      "num_tokens": 6574833.0,
+      "step": 320
+    },
+    {
+      "entropy": 1.0504255175590516,
+      "epoch": 0.6077348066298343,
+      "grad_norm": 0.6496815085411072,
+      "learning_rate": 9.996987110742422e-05,
+      "loss": 0.10904088020324706,
+      "mean_token_accuracy": 0.9585366368293762,
+      "num_tokens": 6780108.0,
+      "step": 330
+    },
+    {
+      "entropy": 1.0456081986427308,
+      "epoch": 0.6261510128913443,
+      "grad_norm": 0.786702573299408,
+      "learning_rate": 9.995837383373119e-05,
+      "loss": 0.10642309188842773,
+      "mean_token_accuracy": 0.9596696078777314,
+      "num_tokens": 6985920.0,
+      "step": 340
+    },
+    {
+      "entropy": 1.0455098271369934,
+      "epoch": 0.6445672191528545,
+      "grad_norm": 0.5473790168762207,
+      "learning_rate": 9.994502326731434e-05,
+      "loss": 0.10822961330413819,
+      "mean_token_accuracy": 0.959563136100769,
+      "num_tokens": 7191465.0,
+      "step": 350
+    },
+    {
+      "entropy": 1.04240562915802,
+      "epoch": 0.6629834254143646,
+      "grad_norm": 0.6672356128692627,
+      "learning_rate": 9.992981990343614e-05,
+      "loss": 0.1110004186630249,
+      "mean_token_accuracy": 0.9582514643669129,
+      "num_tokens": 7396877.0,
+      "step": 360
+    },
+    {
+      "entropy": 1.0386811256408692,
+      "epoch": 0.6813996316758748,
+      "grad_norm": 0.698539674282074,
+      "learning_rate": 9.99127643060918e-05,
+      "loss": 0.107539963722229,
+      "mean_token_accuracy": 0.9593036234378814,
+      "num_tokens": 7602437.0,
+      "step": 370
+    },
+    {
+      "entropy": 1.0311225533485413,
+      "epoch": 0.6998158379373849,
+      "grad_norm": 0.6629284024238586,
+      "learning_rate": 9.989385710798837e-05,
+      "loss": 0.1064023494720459,
+      "mean_token_accuracy": 0.9602205216884613,
+      "num_tokens": 7808142.0,
+      "step": 380
+    },
+    {
+      "entropy": 1.030210506916046,
+      "epoch": 0.7182320441988951,
+      "grad_norm": 0.5616748929023743,
+      "learning_rate": 9.987309901052121e-05,
+      "loss": 0.10717041492462158,
+      "mean_token_accuracy": 0.9599347949028015,
+      "num_tokens": 8013407.0,
+      "step": 390
+    },
+    {
+      "entropy": 1.0208017826080322,
+      "epoch": 0.7366482504604052,
+      "grad_norm": 0.6329049468040466,
+      "learning_rate": 9.985049078374806e-05,
+      "loss": 0.10359601974487305,
+      "mean_token_accuracy": 0.9603756129741668,
+      "num_tokens": 8219040.0,
+      "step": 400
+    },
+    {
+      "entropy": 1.015640377998352,
+      "epoch": 0.7550644567219152,
+      "grad_norm": 0.6516013741493225,
+      "learning_rate": 9.982603326636037e-05,
+      "loss": 0.10146439075469971,
+      "mean_token_accuracy": 0.9627702474594116,
+      "num_tokens": 8424678.0,
+      "step": 410
+    },
+    {
+      "entropy": 1.0105359435081482,
+      "epoch": 0.7734806629834254,
+      "grad_norm": 0.6920603513717651,
+      "learning_rate": 9.979972736565226e-05,
+      "loss": 0.10770498514175415,
+      "mean_token_accuracy": 0.9591470420360565,
+      "num_tokens": 8629868.0,
+      "step": 420
+    },
+    {
+      "entropy": 0.9966452836990356,
+      "epoch": 0.7918968692449355,
+      "grad_norm": 0.6857476234436035,
+      "learning_rate": 9.977157405748687e-05,
+      "loss": 0.10282524824142455,
+      "mean_token_accuracy": 0.9612209022045135,
+      "num_tokens": 8835320.0,
+      "step": 430
+    },
+    {
+      "entropy": 0.9945534646511078,
+      "epoch": 0.8103130755064457,
+      "grad_norm": 0.7208472490310669,
+      "learning_rate": 9.974157438626008e-05,
+      "loss": 0.10069938898086547,
+      "mean_token_accuracy": 0.9620070576667785,
+      "num_tokens": 9041123.0,
+      "step": 440
+    },
+    {
+      "entropy": 0.979461395740509,
+      "epoch": 0.8287292817679558,
+      "grad_norm": 0.5071915984153748,
+      "learning_rate": 9.970972946486185e-05,
+      "loss": 0.09799174070358277,
+      "mean_token_accuracy": 0.9620374023914338,
+      "num_tokens": 9246361.0,
+      "step": 450
+    },
+    {
+      "entropy": 0.9830998003482818,
+      "epoch": 0.8471454880294659,
+      "grad_norm": 0.8660802245140076,
+      "learning_rate": 9.967604047463493e-05,
+      "loss": 0.10378165245056152,
+      "mean_token_accuracy": 0.9606865763664245,
+      "num_tokens": 9451845.0,
+      "step": 460
+    },
+    {
+      "entropy": 0.9813413023948669,
+      "epoch": 0.8655616942909761,
+      "grad_norm": 0.7642477750778198,
+      "learning_rate": 9.964050866533094e-05,
+      "loss": 0.1010061264038086,
+      "mean_token_accuracy": 0.9608745336532593,
+      "num_tokens": 9656802.0,
+      "step": 470
+    },
+    {
+      "entropy": 0.967874163389206,
+      "epoch": 0.8839779005524862,
+      "grad_norm": 0.5987281799316406,
+      "learning_rate": 9.960313535506411e-05,
+      "loss": 0.10169394016265869,
+      "mean_token_accuracy": 0.9611998200416565,
+      "num_tokens": 9861719.0,
+      "step": 480
+    },
+    {
+      "entropy": 0.9663491308689117,
+      "epoch": 0.9023941068139963,
+      "grad_norm": 0.6124638319015503,
+      "learning_rate": 9.956392193026239e-05,
+      "loss": 0.102389657497406,
+      "mean_token_accuracy": 0.9611884355545044,
+      "num_tokens": 10066673.0,
+      "step": 490
+    },
+    {
+      "entropy": 0.959654438495636,
+      "epoch": 0.9208103130755064,
+      "grad_norm": 0.7873051762580872,
+      "learning_rate": 9.952286984561592e-05,
+      "loss": 0.10170392990112305,
+      "mean_token_accuracy": 0.9610928475856781,
+      "num_tokens": 10272091.0,
+      "step": 500
+    },
+    {
+      "entropy": 0.9550537407398224,
+      "epoch": 0.9392265193370166,
+      "grad_norm": 0.6071968078613281,
+      "learning_rate": 9.947998062402313e-05,
+      "loss": 0.09448277950286865,
+      "mean_token_accuracy": 0.9648977637290954,
+      "num_tokens": 10477632.0,
+      "step": 510
+    },
+    {
+      "entropy": 0.9538533687591553,
+      "epoch": 0.9576427255985267,
+      "grad_norm": 0.6317242980003357,
+      "learning_rate": 9.943525585653428e-05,
+      "loss": 0.09542192220687866,
+      "mean_token_accuracy": 0.9635261118412017,
+      "num_tokens": 10682828.0,
+      "step": 520
+    },
+    {
+      "entropy": 0.9362513542175293,
+      "epoch": 0.9760589318600368,
+      "grad_norm": 0.6421944499015808,
+      "learning_rate": 9.938869720229234e-05,
+      "loss": 0.09382058382034301,
+      "mean_token_accuracy": 0.9648073971271515,
+      "num_tokens": 10888741.0,
+      "step": 530
+    },
+    {
+      "entropy": 0.9235438346862793,
+      "epoch": 0.994475138121547,
+      "grad_norm": 0.7986873388290405,
+      "learning_rate": 9.934030638847155e-05,
+      "loss": 0.09827429056167603,
+      "mean_token_accuracy": 0.9621128737926483,
+      "num_tokens": 11094387.0,
+      "step": 540
+    },
+    {
+      "epoch": 1.0,
+      "eval_entropy": 0.9137652366057686,
+      "eval_loss": 0.09368764609098434,
+      "eval_mean_token_accuracy": 0.9640816880309063,
+      "eval_num_tokens": 11155908.0,
+      "eval_runtime": 10.4701,
+      "eval_samples_per_second": 349.377,
+      "eval_steps_per_second": 10.984,
+      "step": 543
+    },
+    {
+      "entropy": 0.9047818422317505,
+      "epoch": 1.0128913443830572,
+      "grad_norm": 0.6781501173973083,
+      "learning_rate": 9.929008521021325e-05,
+      "loss": 0.0863916516304016,
+      "mean_token_accuracy": 0.9673655688762665,
+      "num_tokens": 11299715.0,
+      "step": 550
+    },
+    {
+      "entropy": 0.8856981039047241,
+      "epoch": 1.0313075506445673,
+      "grad_norm": 0.7143136858940125,
+      "learning_rate": 9.923803553055937e-05,
+      "loss": 0.08632323145866394,
+      "mean_token_accuracy": 0.9677783191204071,
+      "num_tokens": 11505059.0,
+      "step": 560
+    },
+    {
+      "entropy": 0.8937099635601043,
+      "epoch": 1.0497237569060773,
+      "grad_norm": 0.7751694321632385,
+      "learning_rate": 9.918415928038325e-05,
+      "loss": 0.08178263902664185,
+      "mean_token_accuracy": 0.9694291114807129,
+      "num_tokens": 11710464.0,
+      "step": 570
+    },
+    {
+      "entropy": 0.8858704209327698,
+      "epoch": 1.0681399631675874,
+      "grad_norm": 0.7492292523384094,
+      "learning_rate": 9.912845845831805e-05,
+      "loss": 0.08074211478233337,
+      "mean_token_accuracy": 0.9692470014095307,
+      "num_tokens": 11915959.0,
+      "step": 580
+    },
+    {
+      "entropy": 0.8948039829730987,
+      "epoch": 1.0865561694290977,
+      "grad_norm": 0.8116479516029358,
+      "learning_rate": 9.907093513068259e-05,
+      "loss": 0.08712012171745301,
+      "mean_token_accuracy": 0.9669980227947235,
+      "num_tokens": 12121499.0,
+      "step": 590
+    },
+    {
+      "entropy": 0.8846789538860321,
+      "epoch": 1.1049723756906078,
+      "grad_norm": 0.7295626997947693,
+      "learning_rate": 9.901159143140471e-05,
+      "loss": 0.08444435596466064,
+      "mean_token_accuracy": 0.9674544095993042,
+      "num_tokens": 12327061.0,
+      "step": 600
+    },
+    {
+      "entropy": 0.8734103918075562,
+      "epoch": 1.1233885819521179,
+      "grad_norm": 0.9585768580436707,
+      "learning_rate": 9.89504295619421e-05,
+      "loss": 0.08022565841674804,
+      "mean_token_accuracy": 0.969569206237793,
+      "num_tokens": 12532305.0,
+      "step": 610
+    },
+    {
+      "entropy": 0.8640486001968384,
+      "epoch": 1.141804788213628,
+      "grad_norm": 0.7891159057617188,
+      "learning_rate": 9.88874517912006e-05,
+      "loss": 0.08415375947952271,
+      "mean_token_accuracy": 0.9678892493247986,
+      "num_tokens": 12737828.0,
+      "step": 620
+    },
+    {
+      "entropy": 0.8599755525588989,
+      "epoch": 1.160220994475138,
+      "grad_norm": 0.5801345109939575,
+      "learning_rate": 9.882266045545012e-05,
+      "loss": 0.08100489974021911,
+      "mean_token_accuracy": 0.9688023269176483,
+      "num_tokens": 12943343.0,
+      "step": 630
+    },
+    {
+      "entropy": 0.86524977684021,
+      "epoch": 1.1786372007366483,
+      "grad_norm": 0.7633041143417358,
+      "learning_rate": 9.87560579582379e-05,
+      "loss": 0.07859406471252442,
+      "mean_token_accuracy": 0.9702189445495606,
+      "num_tokens": 13148473.0,
+      "step": 640
+    },
+    {
+      "entropy": 0.8466695249080658,
+      "epoch": 1.1970534069981584,
+      "grad_norm": 0.8672215938568115,
+      "learning_rate": 9.868764677029934e-05,
+      "loss": 0.08082623481750488,
+      "mean_token_accuracy": 0.9689972400665283,
+      "num_tokens": 13353890.0,
+      "step": 650
+    },
+    {
+      "entropy": 0.8596941530704498,
+      "epoch": 1.2154696132596685,
+      "grad_norm": 0.7524124383926392,
+      "learning_rate": 9.861742942946639e-05,
+      "loss": 0.0789935290813446,
+      "mean_token_accuracy": 0.9693858206272126,
+      "num_tokens": 13559475.0,
+      "step": 660
+    },
+    {
+      "entropy": 0.8708749234676361,
+      "epoch": 1.2338858195211786,
+      "grad_norm": 0.5777031183242798,
+      "learning_rate": 9.854540854057337e-05,
+      "loss": 0.07773642539978028,
+      "mean_token_accuracy": 0.970385092496872,
+      "num_tokens": 13765076.0,
+      "step": 670
+    },
+    {
+      "entropy": 0.8651713371276856,
+      "epoch": 1.2523020257826887,
+      "grad_norm": 0.7924166321754456,
+      "learning_rate": 9.847158677536034e-05,
+      "loss": 0.0766686737537384,
+      "mean_token_accuracy": 0.9702267110347748,
+      "num_tokens": 13970642.0,
+      "step": 680
+    },
+    {
+      "entropy": 0.8763024985790253,
+      "epoch": 1.270718232044199,
+      "grad_norm": 0.741219162940979,
+      "learning_rate": 9.839596687237403e-05,
+      "loss": 0.07189929485321045,
+      "mean_token_accuracy": 0.9727097094058991,
+      "num_tokens": 14176556.0,
+      "step": 690
+    },
+    {
+      "entropy": 0.8556921362876893,
+      "epoch": 1.289134438305709,
+      "grad_norm": 0.6298198103904724,
+      "learning_rate": 9.831855163686618e-05,
+      "loss": 0.07608137726783752,
+      "mean_token_accuracy": 0.9716399371623993,
+      "num_tokens": 14381686.0,
+      "step": 700
+    },
+    {
+      "entropy": 0.869178420305252,
+      "epoch": 1.3075506445672191,
+      "grad_norm": 0.5850273370742798,
+      "learning_rate": 9.823934394068952e-05,
+      "loss": 0.07437651753425598,
+      "mean_token_accuracy": 0.9709566533565521,
+      "num_tokens": 14586814.0,
+      "step": 710
+    },
+    {
+      "entropy": 0.8708595156669616,
+      "epoch": 1.3259668508287292,
+      "grad_norm": 0.6580632328987122,
+      "learning_rate": 9.815834672219127e-05,
+      "loss": 0.07518917322158813,
+      "mean_token_accuracy": 0.9717426657676697,
+      "num_tokens": 14792321.0,
+      "step": 720
+    },
+    {
+      "entropy": 0.8826817810535431,
+      "epoch": 1.3443830570902393,
+      "grad_norm": 0.8788532018661499,
+      "learning_rate": 9.807556298610404e-05,
+      "loss": 0.07579240798950196,
+      "mean_token_accuracy": 0.9706341981887817,
+      "num_tokens": 14997810.0,
+      "step": 730
+    },
+    {
+      "entropy": 0.9012470185756684,
+      "epoch": 1.3627992633517496,
+      "grad_norm": 0.7022138237953186,
+      "learning_rate": 9.799099580343441e-05,
+      "loss": 0.0775588572025299,
+      "mean_token_accuracy": 0.9699241399765015,
+      "num_tokens": 15203795.0,
+      "step": 740
+    },
+    {
+      "entropy": 0.886955714225769,
+      "epoch": 1.3812154696132597,
+      "grad_norm": 0.7881133556365967,
+      "learning_rate": 9.790464831134903e-05,
+      "loss": 0.07125020027160645,
+      "mean_token_accuracy": 0.9723815560340882,
+      "num_tokens": 15408974.0,
+      "step": 750
+    },
+    {
+      "entropy": 0.9047374844551086,
+      "epoch": 1.3996316758747698,
+      "grad_norm": 0.9082005023956299,
+      "learning_rate": 9.781652371305824e-05,
+      "loss": 0.07004334926605224,
+      "mean_token_accuracy": 0.9725580036640167,
+      "num_tokens": 15614399.0,
+      "step": 760
+    },
+    {
+      "entropy": 0.9039053857326508,
+      "epoch": 1.4180478821362799,
+      "grad_norm": 0.8060817122459412,
+      "learning_rate": 9.77266252776972e-05,
+      "loss": 0.07103485465049744,
+      "mean_token_accuracy": 0.9721468150615692,
+      "num_tokens": 15819895.0,
+      "step": 770
+    },
+    {
+      "entropy": 0.8998047232627868,
+      "epoch": 1.43646408839779,
+      "grad_norm": 1.0152642726898193,
+      "learning_rate": 9.763495634020467e-05,
+      "loss": 0.07411704063415528,
+      "mean_token_accuracy": 0.9711063146591187,
+      "num_tokens": 16025297.0,
+      "step": 780
+    },
+    {
+      "entropy": 0.9120213568210602,
+      "epoch": 1.4548802946593002,
+      "grad_norm": 0.6288319826126099,
+      "learning_rate": 9.754152030119921e-05,
+      "loss": 0.07223712205886841,
+      "mean_token_accuracy": 0.9722476422786712,
+      "num_tokens": 16230656.0,
+      "step": 790
+    },
+    {
+      "entropy": 0.9142370820045471,
+      "epoch": 1.4732965009208103,
+      "grad_norm": 0.7854700088500977,
+      "learning_rate": 9.744632062685311e-05,
+      "loss": 0.07186744809150696,
+      "mean_token_accuracy": 0.972247713804245,
+      "num_tokens": 16435943.0,
+      "step": 800
+    },
+    {
+      "entropy": 0.8920814216136932,
+      "epoch": 1.4917127071823204,
+      "grad_norm": 0.6227074265480042,
+      "learning_rate": 9.734936084876383e-05,
+      "loss": 0.07016961574554444,
+      "mean_token_accuracy": 0.9725603640079499,
+      "num_tokens": 16641635.0,
+      "step": 810
+    },
+    {
+      "entropy": 0.891328877210617,
+      "epoch": 1.5101289134438307,
+      "grad_norm": 0.7601346969604492,
+      "learning_rate": 9.725064456382283e-05,
+      "loss": 0.07137494087219239,
+      "mean_token_accuracy": 0.9722997546195984,
+      "num_tokens": 16847194.0,
+      "step": 820
+    },
+    {
+      "entropy": 0.8921217978000641,
+      "epoch": 1.5285451197053406,
+      "grad_norm": 0.7813850045204163,
+      "learning_rate": 9.715017543408233e-05,
+      "loss": 0.06890199184417725,
+      "mean_token_accuracy": 0.9735044002532959,
+      "num_tokens": 17052807.0,
+      "step": 830
+    },
+    {
+      "entropy": 0.9085914671421051,
+      "epoch": 1.5469613259668509,
+      "grad_norm": 0.6184289455413818,
+      "learning_rate": 9.704795718661939e-05,
+      "loss": 0.07043765187263488,
+      "mean_token_accuracy": 0.9725716531276702,
+      "num_tokens": 17258284.0,
+      "step": 840
+    },
+    {
+      "entropy": 0.9029861629009247,
+      "epoch": 1.565377532228361,
+      "grad_norm": 0.7082377076148987,
+      "learning_rate": 9.694399361339752e-05,
+      "loss": 0.07113839387893676,
+      "mean_token_accuracy": 0.9725669205188752,
+      "num_tokens": 17464326.0,
+      "step": 850
+    },
+    {
+      "entropy": 0.8856533527374267,
+      "epoch": 1.583793738489871,
+      "grad_norm": 0.7409216165542603,
+      "learning_rate": 9.683828857112627e-05,
+      "loss": 0.07077333331108093,
+      "mean_token_accuracy": 0.9731084644794464,
+      "num_tokens": 17669537.0,
+      "step": 860
+    },
+    {
+      "entropy": 0.8613030433654785,
+      "epoch": 1.6022099447513813,
+      "grad_norm": 0.6801561713218689,
+      "learning_rate": 9.673084598111789e-05,
+      "loss": 0.06885308027267456,
+      "mean_token_accuracy": 0.97266526222229,
+      "num_tokens": 17875289.0,
+      "step": 870
+    },
+    {
+      "entropy": 0.8692965865135193,
+      "epoch": 1.6206261510128912,
+      "grad_norm": 1.1621277332305908,
+      "learning_rate": 9.662166982914203e-05,
+      "loss": 0.07017780542373657,
+      "mean_token_accuracy": 0.9733059942722321,
+      "num_tokens": 18080404.0,
+      "step": 880
+    },
+    {
+      "entropy": 0.8671502113342285,
+      "epoch": 1.6390423572744015,
+      "grad_norm": 0.7518903613090515,
+      "learning_rate": 9.651076416527787e-05,
+      "loss": 0.06977018713951111,
+      "mean_token_accuracy": 0.9730017304420471,
+      "num_tokens": 18285699.0,
+      "step": 890
+    },
+    {
+      "entropy": 0.8662045657634735,
+      "epoch": 1.6574585635359116,
+      "grad_norm": 0.6622698903083801,
+      "learning_rate": 9.639813310376378e-05,
+      "loss": 0.06620995998382569,
+      "mean_token_accuracy": 0.9737491130828857,
+      "num_tokens": 18491097.0,
+      "step": 900
+    },
+    {
+      "entropy": 0.8548173069953918,
+      "epoch": 1.6758747697974217,
+      "grad_norm": 0.8941843509674072,
+      "learning_rate": 9.628378082284479e-05,
+      "loss": 0.06711119413375854,
+      "mean_token_accuracy": 0.9740589797496796,
+      "num_tokens": 18696827.0,
+      "step": 910
+    },
+    {
+      "entropy": 0.8763562262058258,
+      "epoch": 1.694290976058932,
+      "grad_norm": 0.7571700215339661,
+      "learning_rate": 9.616771156461755e-05,
+      "loss": 0.07263468503952027,
+      "mean_token_accuracy": 0.9717419981956482,
+      "num_tokens": 18902513.0,
+      "step": 920
+    },
+    {
+      "entropy": 0.8663733780384064,
+      "epoch": 1.7127071823204418,
+      "grad_norm": 0.7886489629745483,
+      "learning_rate": 9.604992963487298e-05,
+      "loss": 0.07074605226516724,
+      "mean_token_accuracy": 0.9724965393543243,
+      "num_tokens": 19107812.0,
+      "step": 930
+    },
+    {
+      "entropy": 0.8673004627227783,
+      "epoch": 1.7311233885819521,
+      "grad_norm": 0.8180726170539856,
+      "learning_rate": 9.593043940293647e-05,
+      "loss": 0.06831735372543335,
+      "mean_token_accuracy": 0.9733696818351746,
+      "num_tokens": 19313330.0,
+      "step": 940
+    },
+    {
+      "entropy": 0.8525971233844757,
+      "epoch": 1.7495395948434622,
+      "grad_norm": 0.6576228737831116,
+      "learning_rate": 9.580924530150595e-05,
+      "loss": 0.06567002534866333,
+      "mean_token_accuracy": 0.9745754361152649,
+      "num_tokens": 19518671.0,
+      "step": 950
+    },
+    {
+      "entropy": 0.8605451703071594,
+      "epoch": 1.7679558011049723,
+      "grad_norm": 0.7171661257743835,
+      "learning_rate": 9.568635182648725e-05,
+      "loss": 0.06872050762176514,
+      "mean_token_accuracy": 0.9732091546058654,
+      "num_tokens": 19724135.0,
+      "step": 960
+    },
+    {
+      "entropy": 0.8642210960388184,
+      "epoch": 1.7863720073664826,
+      "grad_norm": 0.7603147029876709,
+      "learning_rate": 9.556176353682746e-05,
+      "loss": 0.06766576766967773,
+      "mean_token_accuracy": 0.9728681743144989,
+      "num_tokens": 19928785.0,
+      "step": 970
+    },
+    {
+      "entropy": 0.8543185651302337,
+      "epoch": 1.8047882136279927,
+      "grad_norm": 0.7280875444412231,
+      "learning_rate": 9.543548505434581e-05,
+      "loss": 0.06851862668991089,
+      "mean_token_accuracy": 0.9737437188625335,
+      "num_tokens": 20134195.0,
+      "step": 980
+    },
+    {
+      "entropy": 0.8744745373725891,
+      "epoch": 1.8232044198895028,
+      "grad_norm": 0.5897248983383179,
+      "learning_rate": 9.530752106356209e-05,
+      "loss": 0.06809053421020508,
+      "mean_token_accuracy": 0.9733593761920929,
+      "num_tokens": 20339517.0,
+      "step": 990
+    },
+    {
+      "entropy": 0.8623859465122223,
+      "epoch": 1.8416206261510129,
+      "grad_norm": 0.7515265345573425,
+      "learning_rate": 9.517787631152298e-05,
+      "loss": 0.07257847785949707,
+      "mean_token_accuracy": 0.9714054942131043,
+      "num_tokens": 20545249.0,
+      "step": 1000
+    },
+    {
+      "entropy": 0.8669404804706573,
+      "epoch": 1.860036832412523,
+      "grad_norm": 0.7144560813903809,
+      "learning_rate": 9.504655560762596e-05,
+      "loss": 0.06832354068756104,
+      "mean_token_accuracy": 0.9735779523849487,
+      "num_tokens": 20750507.0,
+      "step": 1010
+    },
+    {
+      "entropy": 0.8493516445159912,
+      "epoch": 1.8784530386740332,
+      "grad_norm": 0.6559189558029175,
+      "learning_rate": 9.491356382344081e-05,
+      "loss": 0.0629766047000885,
+      "mean_token_accuracy": 0.9754977762699127,
+      "num_tokens": 20955956.0,
+      "step": 1020
+    },
+    {
+      "entropy": 0.8599376022815705,
+      "epoch": 1.8968692449355433,
+      "grad_norm": 0.6792973279953003,
+      "learning_rate": 9.477890589252895e-05,
+      "loss": 0.0666757881641388,
+      "mean_token_accuracy": 0.974083811044693,
+      "num_tokens": 21161163.0,
+      "step": 1030
+    },
+    {
+      "entropy": 0.8458438158035279,
+      "epoch": 1.9152854511970534,
+      "grad_norm": 0.6941778659820557,
+      "learning_rate": 9.464258681026042e-05,
+      "loss": 0.06307152509689332,
+      "mean_token_accuracy": 0.9757042229175568,
+      "num_tokens": 21366525.0,
+      "step": 1040
+    },
+    {
+      "entropy": 0.848515909910202,
+      "epoch": 1.9337016574585635,
+      "grad_norm": 0.7307806611061096,
+      "learning_rate": 9.450461163362855e-05,
+      "loss": 0.06307026147842407,
+      "mean_token_accuracy": 0.9750974595546722,
+      "num_tokens": 21572238.0,
+      "step": 1050
+    },
+    {
+      "entropy": 0.8563454031944275,
+      "epoch": 1.9521178637200736,
+      "grad_norm": 0.7222106456756592,
+      "learning_rate": 9.436498548106236e-05,
+      "loss": 0.0647726058959961,
+      "mean_token_accuracy": 0.974629694223404,
+      "num_tokens": 21777633.0,
+      "step": 1060
+    },
+    {
+      "entropy": 0.8656457483768463,
+      "epoch": 1.9705340699815839,
+      "grad_norm": 0.67178875207901,
+      "learning_rate": 9.422371353223674e-05,
+      "loss": 0.06573554277420043,
+      "mean_token_accuracy": 0.9745908617973328,
+      "num_tokens": 21983116.0,
+      "step": 1070
+    },
+    {
+      "entropy": 0.8630891263484954,
+      "epoch": 1.988950276243094,
+      "grad_norm": 0.6956593990325928,
+      "learning_rate": 9.408080102788016e-05,
+      "loss": 0.06630704402923585,
+      "mean_token_accuracy": 0.9741333484649658,
+      "num_tokens": 22188662.0,
+      "step": 1080
+    },
+    {
+      "epoch": 2.0,
+      "eval_entropy": 0.8560857042022373,
+      "eval_loss": 0.06494329869747162,
+      "eval_mean_token_accuracy": 0.9745692672936813,
+      "eval_num_tokens": 22311800.0,
+      "eval_runtime": 10.129,
+      "eval_samples_per_second": 361.142,
+      "eval_steps_per_second": 11.354,
+      "step": 1086
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 5430,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 10,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.0639691635941704e+18,
+  "train_batch_size": 32,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-1086/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21325c9bdff5ed34f0cc34837ee67ed216c9301ab4d9b2e26f048b563564bd75
+size 5777

checkpoint-1629/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-1629/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-1629/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ebf4459fd4ce731043eb056554bc1c81afea162e43afc91a4e21906da57bbdc0
+size 80792096

checkpoint-1629/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-1629/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

checkpoint-1629/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

checkpoint-1629/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1687 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 3.0,
+  "eval_steps": 500,
+  "global_step": 1629,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "entropy": 1.2237394809722901,
+      "epoch": 0.01841620626151013,
+      "grad_norm": 5.082435607910156,
+      "learning_rate": 3.308823529411765e-06,
+      "loss": 0.9237876892089844,
+      "mean_token_accuracy": 0.7685343027114868,
+      "num_tokens": 205423.0,
+      "step": 10
+    },
+    {
+      "entropy": 1.2295925617218018,
+      "epoch": 0.03683241252302026,
+      "grad_norm": 4.672000408172607,
+      "learning_rate": 6.985294117647059e-06,
+      "loss": 0.8900892257690429,
+      "mean_token_accuracy": 0.7677771031856537,
+      "num_tokens": 410849.0,
+      "step": 20
+    },
+    {
+      "entropy": 1.2285718679428101,
+      "epoch": 0.055248618784530384,
+      "grad_norm": 1.4828118085861206,
+      "learning_rate": 1.0661764705882354e-05,
+      "loss": 0.5975452899932862,
+      "mean_token_accuracy": 0.8146551787853241,
+      "num_tokens": 616438.0,
+      "step": 30
+    },
+    {
+      "entropy": 1.210776400566101,
+      "epoch": 0.07366482504604052,
+      "grad_norm": 0.7761328816413879,
+      "learning_rate": 1.4338235294117647e-05,
+      "loss": 0.40664992332458494,
+      "mean_token_accuracy": 0.8699092030525207,
+      "num_tokens": 822118.0,
+      "step": 40
+    },
+    {
+      "entropy": 1.200321125984192,
+      "epoch": 0.09208103130755065,
+      "grad_norm": 0.5363371968269348,
+      "learning_rate": 1.8014705882352943e-05,
+      "loss": 0.3313469409942627,
+      "mean_token_accuracy": 0.8904915869235992,
+      "num_tokens": 1027941.0,
+      "step": 50
+    },
+    {
+      "entropy": 1.1809936046600342,
+      "epoch": 0.11049723756906077,
+      "grad_norm": 0.39541518688201904,
+      "learning_rate": 2.1691176470588237e-05,
+      "loss": 0.27568228244781495,
+      "mean_token_accuracy": 0.9047131836414337,
+      "num_tokens": 1233620.0,
+      "step": 60
+    },
+    {
+      "entropy": 1.169810914993286,
+      "epoch": 0.1289134438305709,
+      "grad_norm": 0.341960072517395,
+      "learning_rate": 2.536764705882353e-05,
+      "loss": 0.245219087600708,
+      "mean_token_accuracy": 0.9150686681270599,
+      "num_tokens": 1438656.0,
+      "step": 70
+    },
+    {
+      "entropy": 1.1652960777282715,
+      "epoch": 0.14732965009208104,
+      "grad_norm": 0.36872178316116333,
+      "learning_rate": 2.9044117647058828e-05,
+      "loss": 0.2220149040222168,
+      "mean_token_accuracy": 0.9224777698516846,
+      "num_tokens": 1643877.0,
+      "step": 80
+    },
+    {
+      "entropy": 1.154341197013855,
+      "epoch": 0.16574585635359115,
+      "grad_norm": 0.4152425229549408,
+      "learning_rate": 3.272058823529412e-05,
+      "loss": 0.2002798557281494,
+      "mean_token_accuracy": 0.9285802960395813,
+      "num_tokens": 1849506.0,
+      "step": 90
+    },
+    {
+      "entropy": 1.1507258892059327,
+      "epoch": 0.1841620626151013,
+      "grad_norm": 0.47647765278816223,
+      "learning_rate": 3.639705882352941e-05,
+      "loss": 0.18871363401412963,
+      "mean_token_accuracy": 0.9318056285381318,
+      "num_tokens": 2055071.0,
+      "step": 100
+    },
+    {
+      "entropy": 1.1455535531044005,
+      "epoch": 0.20257826887661143,
+      "grad_norm": 0.4853009581565857,
+      "learning_rate": 4.007352941176471e-05,
+      "loss": 0.17836341857910157,
+      "mean_token_accuracy": 0.9367631554603577,
+      "num_tokens": 2260643.0,
+      "step": 110
+    },
+    {
+      "entropy": 1.1402526497840881,
+      "epoch": 0.22099447513812154,
+      "grad_norm": 0.4455392360687256,
+      "learning_rate": 4.375e-05,
+      "loss": 0.16921783685684205,
+      "mean_token_accuracy": 0.9386959195137023,
+      "num_tokens": 2466085.0,
+      "step": 120
+    },
+    {
+      "entropy": 1.1374777555465698,
+      "epoch": 0.23941068139963168,
+      "grad_norm": 0.5880279541015625,
+      "learning_rate": 4.742647058823529e-05,
+      "loss": 0.15989291667938232,
+      "mean_token_accuracy": 0.9421182632446289,
+      "num_tokens": 2671024.0,
+      "step": 130
+    },
+    {
+      "entropy": 1.1273940205574036,
+      "epoch": 0.2578268876611418,
+      "grad_norm": 0.612959086894989,
+      "learning_rate": 5.110294117647059e-05,
+      "loss": 0.14701461791992188,
+      "mean_token_accuracy": 0.9463540315628052,
+      "num_tokens": 2876848.0,
+      "step": 140
+    },
+    {
+      "entropy": 1.1263513088226318,
+      "epoch": 0.27624309392265195,
+      "grad_norm": 0.5695255398750305,
+      "learning_rate": 5.477941176470589e-05,
+      "loss": 0.14604382514953612,
+      "mean_token_accuracy": 0.946351945400238,
+      "num_tokens": 3082589.0,
+      "step": 150
+    },
+    {
+      "entropy": 1.1290789365768432,
+      "epoch": 0.2946593001841621,
+      "grad_norm": 0.6608090996742249,
+      "learning_rate": 5.845588235294118e-05,
+      "loss": 0.1409450054168701,
+      "mean_token_accuracy": 0.9481450319290161,
+      "num_tokens": 3287459.0,
+      "step": 160
+    },
+    {
+      "entropy": 1.1291529774665832,
+      "epoch": 0.31307550644567217,
+      "grad_norm": 0.652715802192688,
+      "learning_rate": 6.213235294117647e-05,
+      "loss": 0.14441155195236205,
+      "mean_token_accuracy": 0.9466125547885895,
+      "num_tokens": 3493682.0,
+      "step": 170
+    },
+    {
+      "entropy": 1.1244838953018188,
+      "epoch": 0.3314917127071823,
+      "grad_norm": 0.7815241813659668,
+      "learning_rate": 6.580882352941177e-05,
+      "loss": 0.13361064195632935,
+      "mean_token_accuracy": 0.9512295544147491,
+      "num_tokens": 3699573.0,
+      "step": 180
+    },
+    {
+      "entropy": 1.1217721104621887,
+      "epoch": 0.34990791896869244,
+      "grad_norm": 0.7933160066604614,
+      "learning_rate": 6.948529411764706e-05,
+      "loss": 0.13089522123336791,
+      "mean_token_accuracy": 0.9520221531391144,
+      "num_tokens": 3905156.0,
+      "step": 190
+    },
+    {
+      "entropy": 1.1206679105758668,
+      "epoch": 0.3683241252302026,
+      "grad_norm": 0.6815240383148193,
+      "learning_rate": 7.316176470588236e-05,
+      "loss": 0.13400404453277587,
+      "mean_token_accuracy": 0.9501322209835052,
+      "num_tokens": 4110570.0,
+      "step": 200
+    },
+    {
+      "entropy": 1.1161052227020263,
+      "epoch": 0.3867403314917127,
+      "grad_norm": 0.8297767639160156,
+      "learning_rate": 7.683823529411766e-05,
+      "loss": 0.13389937877655028,
+      "mean_token_accuracy": 0.9501932203769684,
+      "num_tokens": 4315834.0,
+      "step": 210
+    },
+    {
+      "entropy": 1.1098745942115784,
+      "epoch": 0.40515653775322286,
+      "grad_norm": 0.5943381786346436,
+      "learning_rate": 8.051470588235294e-05,
+      "loss": 0.13452907800674438,
+      "mean_token_accuracy": 0.9503286242485046,
+      "num_tokens": 4520807.0,
+      "step": 220
+    },
+    {
+      "entropy": 1.100480353832245,
+      "epoch": 0.42357274401473294,
+      "grad_norm": 0.6094359755516052,
+      "learning_rate": 8.419117647058824e-05,
+      "loss": 0.12827746868133544,
+      "mean_token_accuracy": 0.952492094039917,
+      "num_tokens": 4725867.0,
+      "step": 230
+    },
+    {
+      "entropy": 1.0901286959648133,
+      "epoch": 0.4419889502762431,
+      "grad_norm": 0.7240597605705261,
+      "learning_rate": 8.786764705882353e-05,
+      "loss": 0.12171242237091065,
+      "mean_token_accuracy": 0.953943532705307,
+      "num_tokens": 4931629.0,
+      "step": 240
+    },
+    {
+      "entropy": 1.0885071873664856,
+      "epoch": 0.4604051565377532,
+      "grad_norm": 0.6939547657966614,
+      "learning_rate": 9.154411764705882e-05,
+      "loss": 0.12155698537826538,
+      "mean_token_accuracy": 0.9545870959758759,
+      "num_tokens": 5137285.0,
+      "step": 250
+    },
+    {
+      "entropy": 1.086272156238556,
+      "epoch": 0.47882136279926335,
+      "grad_norm": 0.5752800703048706,
+      "learning_rate": 9.522058823529412e-05,
+      "loss": 0.12157790660858155,
+      "mean_token_accuracy": 0.9541126549243927,
+      "num_tokens": 5342575.0,
+      "step": 260
+    },
+    {
+      "entropy": 1.0857678413391114,
+      "epoch": 0.4972375690607735,
+      "grad_norm": 0.7565123438835144,
+      "learning_rate": 9.889705882352942e-05,
+      "loss": 0.12349612712860107,
+      "mean_token_accuracy": 0.9535140514373779,
+      "num_tokens": 5547995.0,
+      "step": 270
+    },
+    {
+      "entropy": 1.079762625694275,
+      "epoch": 0.5156537753222836,
+      "grad_norm": 0.6972768306732178,
+      "learning_rate": 9.999954556423843e-05,
+      "loss": 0.11875582933425903,
+      "mean_token_accuracy": 0.9556483089923858,
+      "num_tokens": 5753195.0,
+      "step": 280
+    },
+    {
+      "entropy": 1.0742079138755798,
+      "epoch": 0.5340699815837937,
+      "grad_norm": 0.7821696996688843,
+      "learning_rate": 9.999731977631227e-05,
+      "loss": 0.11824090480804443,
+      "mean_token_accuracy": 0.9557521045207977,
+      "num_tokens": 5958236.0,
+      "step": 290
+    },
+    {
+      "entropy": 1.0679773569107056,
+      "epoch": 0.5524861878453039,
+      "grad_norm": 0.5846888422966003,
+      "learning_rate": 9.999323925089486e-05,
+      "loss": 0.11707355976104736,
+      "mean_token_accuracy": 0.9554719448089599,
+      "num_tokens": 6163992.0,
+      "step": 300
+    },
+    {
+      "entropy": 1.0655727863311768,
+      "epoch": 0.570902394106814,
+      "grad_norm": 0.5812502503395081,
+      "learning_rate": 9.998730413936037e-05,
+      "loss": 0.11371417045593261,
+      "mean_token_accuracy": 0.9576376020908356,
+      "num_tokens": 6369456.0,
+      "step": 310
+    },
+    {
+      "entropy": 1.0607039332389832,
+      "epoch": 0.5893186003683242,
+      "grad_norm": 0.6238475441932678,
+      "learning_rate": 9.99795146618821e-05,
+      "loss": 0.11775733232498169,
+      "mean_token_accuracy": 0.9557221591472626,
+      "num_tokens": 6574833.0,
+      "step": 320
+    },
+    {
+      "entropy": 1.0504255175590516,
+      "epoch": 0.6077348066298343,
+      "grad_norm": 0.6496815085411072,
+      "learning_rate": 9.996987110742422e-05,
+      "loss": 0.10904088020324706,
+      "mean_token_accuracy": 0.9585366368293762,
+      "num_tokens": 6780108.0,
+      "step": 330
+    },
+    {
+      "entropy": 1.0456081986427308,
+      "epoch": 0.6261510128913443,
+      "grad_norm": 0.786702573299408,
+      "learning_rate": 9.995837383373119e-05,
+      "loss": 0.10642309188842773,
+      "mean_token_accuracy": 0.9596696078777314,
+      "num_tokens": 6985920.0,
+      "step": 340
+    },
+    {
+      "entropy": 1.0455098271369934,
+      "epoch": 0.6445672191528545,
+      "grad_norm": 0.5473790168762207,
+      "learning_rate": 9.994502326731434e-05,
+      "loss": 0.10822961330413819,
+      "mean_token_accuracy": 0.959563136100769,
+      "num_tokens": 7191465.0,
+      "step": 350
+    },
+    {
+      "entropy": 1.04240562915802,
+      "epoch": 0.6629834254143646,
+      "grad_norm": 0.6672356128692627,
+      "learning_rate": 9.992981990343614e-05,
+      "loss": 0.1110004186630249,
+      "mean_token_accuracy": 0.9582514643669129,
+      "num_tokens": 7396877.0,
+      "step": 360
+    },
+    {
+      "entropy": 1.0386811256408692,
+      "epoch": 0.6813996316758748,
+      "grad_norm": 0.698539674282074,
+      "learning_rate": 9.99127643060918e-05,
+      "loss": 0.107539963722229,
+      "mean_token_accuracy": 0.9593036234378814,
+      "num_tokens": 7602437.0,
+      "step": 370
+    },
+    {
+      "entropy": 1.0311225533485413,
+      "epoch": 0.6998158379373849,
+      "grad_norm": 0.6629284024238586,
+      "learning_rate": 9.989385710798837e-05,
+      "loss": 0.1064023494720459,
+      "mean_token_accuracy": 0.9602205216884613,
+      "num_tokens": 7808142.0,
+      "step": 380
+    },
+    {
+      "entropy": 1.030210506916046,
+      "epoch": 0.7182320441988951,
+      "grad_norm": 0.5616748929023743,
+      "learning_rate": 9.987309901052121e-05,
+      "loss": 0.10717041492462158,
+      "mean_token_accuracy": 0.9599347949028015,
+      "num_tokens": 8013407.0,
+      "step": 390
+    },
+    {
+      "entropy": 1.0208017826080322,
+      "epoch": 0.7366482504604052,
+      "grad_norm": 0.6329049468040466,
+      "learning_rate": 9.985049078374806e-05,
+      "loss": 0.10359601974487305,
+      "mean_token_accuracy": 0.9603756129741668,
+      "num_tokens": 8219040.0,
+      "step": 400
+    },
+    {
+      "entropy": 1.015640377998352,
+      "epoch": 0.7550644567219152,
+      "grad_norm": 0.6516013741493225,
+      "learning_rate": 9.982603326636037e-05,
+      "loss": 0.10146439075469971,
+      "mean_token_accuracy": 0.9627702474594116,
+      "num_tokens": 8424678.0,
+      "step": 410
+    },
+    {
+      "entropy": 1.0105359435081482,
+      "epoch": 0.7734806629834254,
+      "grad_norm": 0.6920603513717651,
+      "learning_rate": 9.979972736565226e-05,
+      "loss": 0.10770498514175415,
+      "mean_token_accuracy": 0.9591470420360565,
+      "num_tokens": 8629868.0,
+      "step": 420
+    },
+    {
+      "entropy": 0.9966452836990356,
+      "epoch": 0.7918968692449355,
+      "grad_norm": 0.6857476234436035,
+      "learning_rate": 9.977157405748687e-05,
+      "loss": 0.10282524824142455,
+      "mean_token_accuracy": 0.9612209022045135,
+      "num_tokens": 8835320.0,
+      "step": 430
+    },
+    {
+      "entropy": 0.9945534646511078,
+      "epoch": 0.8103130755064457,
+      "grad_norm": 0.7208472490310669,
+      "learning_rate": 9.974157438626008e-05,
+      "loss": 0.10069938898086547,
+      "mean_token_accuracy": 0.9620070576667785,
+      "num_tokens": 9041123.0,
+      "step": 440
+    },
+    {
+      "entropy": 0.979461395740509,
+      "epoch": 0.8287292817679558,
+      "grad_norm": 0.5071915984153748,
+      "learning_rate": 9.970972946486185e-05,
+      "loss": 0.09799174070358277,
+      "mean_token_accuracy": 0.9620374023914338,
+      "num_tokens": 9246361.0,
+      "step": 450
+    },
+    {
+      "entropy": 0.9830998003482818,
+      "epoch": 0.8471454880294659,
+      "grad_norm": 0.8660802245140076,
+      "learning_rate": 9.967604047463493e-05,
+      "loss": 0.10378165245056152,
+      "mean_token_accuracy": 0.9606865763664245,
+      "num_tokens": 9451845.0,
+      "step": 460
+    },
+    {
+      "entropy": 0.9813413023948669,
+      "epoch": 0.8655616942909761,
+      "grad_norm": 0.7642477750778198,
+      "learning_rate": 9.964050866533094e-05,
+      "loss": 0.1010061264038086,
+      "mean_token_accuracy": 0.9608745336532593,
+      "num_tokens": 9656802.0,
+      "step": 470
+    },
+    {
+      "entropy": 0.967874163389206,
+      "epoch": 0.8839779005524862,
+      "grad_norm": 0.5987281799316406,
+      "learning_rate": 9.960313535506411e-05,
+      "loss": 0.10169394016265869,
+      "mean_token_accuracy": 0.9611998200416565,
+      "num_tokens": 9861719.0,
+      "step": 480
+    },
+    {
+      "entropy": 0.9663491308689117,
+      "epoch": 0.9023941068139963,
+      "grad_norm": 0.6124638319015503,
+      "learning_rate": 9.956392193026239e-05,
+      "loss": 0.102389657497406,
+      "mean_token_accuracy": 0.9611884355545044,
+      "num_tokens": 10066673.0,
+      "step": 490
+    },
+    {
+      "entropy": 0.959654438495636,
+      "epoch": 0.9208103130755064,
+      "grad_norm": 0.7873051762580872,
+      "learning_rate": 9.952286984561592e-05,
+      "loss": 0.10170392990112305,
+      "mean_token_accuracy": 0.9610928475856781,
+      "num_tokens": 10272091.0,
+      "step": 500
+    },
+    {
+      "entropy": 0.9550537407398224,
+      "epoch": 0.9392265193370166,
+      "grad_norm": 0.6071968078613281,
+      "learning_rate": 9.947998062402313e-05,
+      "loss": 0.09448277950286865,
+      "mean_token_accuracy": 0.9648977637290954,
+      "num_tokens": 10477632.0,
+      "step": 510
+    },
+    {
+      "entropy": 0.9538533687591553,
+      "epoch": 0.9576427255985267,
+      "grad_norm": 0.6317242980003357,
+      "learning_rate": 9.943525585653428e-05,
+      "loss": 0.09542192220687866,
+      "mean_token_accuracy": 0.9635261118412017,
+      "num_tokens": 10682828.0,
+      "step": 520
+    },
+    {
+      "entropy": 0.9362513542175293,
+      "epoch": 0.9760589318600368,
+      "grad_norm": 0.6421944499015808,
+      "learning_rate": 9.938869720229234e-05,
+      "loss": 0.09382058382034301,
+      "mean_token_accuracy": 0.9648073971271515,
+      "num_tokens": 10888741.0,
+      "step": 530
+    },
+    {
+      "entropy": 0.9235438346862793,
+      "epoch": 0.994475138121547,
+      "grad_norm": 0.7986873388290405,
+      "learning_rate": 9.934030638847155e-05,
+      "loss": 0.09827429056167603,
+      "mean_token_accuracy": 0.9621128737926483,
+      "num_tokens": 11094387.0,
+      "step": 540
+    },
+    {
+      "epoch": 1.0,
+      "eval_entropy": 0.9137652366057686,
+      "eval_loss": 0.09368764609098434,
+      "eval_mean_token_accuracy": 0.9640816880309063,
+      "eval_num_tokens": 11155908.0,
+      "eval_runtime": 10.4701,
+      "eval_samples_per_second": 349.377,
+      "eval_steps_per_second": 10.984,
+      "step": 543
+    },
+    {
+      "entropy": 0.9047818422317505,
+      "epoch": 1.0128913443830572,
+      "grad_norm": 0.6781501173973083,
+      "learning_rate": 9.929008521021325e-05,
+      "loss": 0.0863916516304016,
+      "mean_token_accuracy": 0.9673655688762665,
+      "num_tokens": 11299715.0,
+      "step": 550
+    },
+    {
+      "entropy": 0.8856981039047241,
+      "epoch": 1.0313075506445673,
+      "grad_norm": 0.7143136858940125,
+      "learning_rate": 9.923803553055937e-05,
+      "loss": 0.08632323145866394,
+      "mean_token_accuracy": 0.9677783191204071,
+      "num_tokens": 11505059.0,
+      "step": 560
+    },
+    {
+      "entropy": 0.8937099635601043,
+      "epoch": 1.0497237569060773,
+      "grad_norm": 0.7751694321632385,
+      "learning_rate": 9.918415928038325e-05,
+      "loss": 0.08178263902664185,
+      "mean_token_accuracy": 0.9694291114807129,
+      "num_tokens": 11710464.0,
+      "step": 570
+    },
+    {
+      "entropy": 0.8858704209327698,
+      "epoch": 1.0681399631675874,
+      "grad_norm": 0.7492292523384094,
+      "learning_rate": 9.912845845831805e-05,
+      "loss": 0.08074211478233337,
+      "mean_token_accuracy": 0.9692470014095307,
+      "num_tokens": 11915959.0,
+      "step": 580
+    },
+    {
+      "entropy": 0.8948039829730987,
+      "epoch": 1.0865561694290977,
+      "grad_norm": 0.8116479516029358,
+      "learning_rate": 9.907093513068259e-05,
+      "loss": 0.08712012171745301,
+      "mean_token_accuracy": 0.9669980227947235,
+      "num_tokens": 12121499.0,
+      "step": 590
+    },
+    {
+      "entropy": 0.8846789538860321,
+      "epoch": 1.1049723756906078,
+      "grad_norm": 0.7295626997947693,
+      "learning_rate": 9.901159143140471e-05,
+      "loss": 0.08444435596466064,
+      "mean_token_accuracy": 0.9674544095993042,
+      "num_tokens": 12327061.0,
+      "step": 600
+    },
+    {
+      "entropy": 0.8734103918075562,
+      "epoch": 1.1233885819521179,
+      "grad_norm": 0.9585768580436707,
+      "learning_rate": 9.89504295619421e-05,
+      "loss": 0.08022565841674804,
+      "mean_token_accuracy": 0.969569206237793,
+      "num_tokens": 12532305.0,
+      "step": 610
+    },
+    {
+      "entropy": 0.8640486001968384,
+      "epoch": 1.141804788213628,
+      "grad_norm": 0.7891159057617188,
+      "learning_rate": 9.88874517912006e-05,
+      "loss": 0.08415375947952271,
+      "mean_token_accuracy": 0.9678892493247986,
+      "num_tokens": 12737828.0,
+      "step": 620
+    },
+    {
+      "entropy": 0.8599755525588989,
+      "epoch": 1.160220994475138,
+      "grad_norm": 0.5801345109939575,
+      "learning_rate": 9.882266045545012e-05,
+      "loss": 0.08100489974021911,
+      "mean_token_accuracy": 0.9688023269176483,
+      "num_tokens": 12943343.0,
+      "step": 630
+    },
+    {
+      "entropy": 0.86524977684021,
+      "epoch": 1.1786372007366483,
+      "grad_norm": 0.7633041143417358,
+      "learning_rate": 9.87560579582379e-05,
+      "loss": 0.07859406471252442,
+      "mean_token_accuracy": 0.9702189445495606,
+      "num_tokens": 13148473.0,
+      "step": 640
+    },
+    {
+      "entropy": 0.8466695249080658,
+      "epoch": 1.1970534069981584,
+      "grad_norm": 0.8672215938568115,
+      "learning_rate": 9.868764677029934e-05,
+      "loss": 0.08082623481750488,
+      "mean_token_accuracy": 0.9689972400665283,
+      "num_tokens": 13353890.0,
+      "step": 650
+    },
+    {
+      "entropy": 0.8596941530704498,
+      "epoch": 1.2154696132596685,
+      "grad_norm": 0.7524124383926392,
+      "learning_rate": 9.861742942946639e-05,
+      "loss": 0.0789935290813446,
+      "mean_token_accuracy": 0.9693858206272126,
+      "num_tokens": 13559475.0,
+      "step": 660
+    },
+    {
+      "entropy": 0.8708749234676361,
+      "epoch": 1.2338858195211786,
+      "grad_norm": 0.5777031183242798,
+      "learning_rate": 9.854540854057337e-05,
+      "loss": 0.07773642539978028,
+      "mean_token_accuracy": 0.970385092496872,
+      "num_tokens": 13765076.0,
+      "step": 670
+    },
+    {
+      "entropy": 0.8651713371276856,
+      "epoch": 1.2523020257826887,
+      "grad_norm": 0.7924166321754456,
+      "learning_rate": 9.847158677536034e-05,
+      "loss": 0.0766686737537384,
+      "mean_token_accuracy": 0.9702267110347748,
+      "num_tokens": 13970642.0,
+      "step": 680
+    },
+    {
+      "entropy": 0.8763024985790253,
+      "epoch": 1.270718232044199,
+      "grad_norm": 0.741219162940979,
+      "learning_rate": 9.839596687237403e-05,
+      "loss": 0.07189929485321045,
+      "mean_token_accuracy": 0.9727097094058991,
+      "num_tokens": 14176556.0,
+      "step": 690
+    },
+    {
+      "entropy": 0.8556921362876893,
+      "epoch": 1.289134438305709,
+      "grad_norm": 0.6298198103904724,
+      "learning_rate": 9.831855163686618e-05,
+      "loss": 0.07608137726783752,
+      "mean_token_accuracy": 0.9716399371623993,
+      "num_tokens": 14381686.0,
+      "step": 700
+    },
+    {
+      "entropy": 0.869178420305252,
+      "epoch": 1.3075506445672191,
+      "grad_norm": 0.5850273370742798,
+      "learning_rate": 9.823934394068952e-05,
+      "loss": 0.07437651753425598,
+      "mean_token_accuracy": 0.9709566533565521,
+      "num_tokens": 14586814.0,
+      "step": 710
+    },
+    {
+      "entropy": 0.8708595156669616,
+      "epoch": 1.3259668508287292,
+      "grad_norm": 0.6580632328987122,
+      "learning_rate": 9.815834672219127e-05,
+      "loss": 0.07518917322158813,
+      "mean_token_accuracy": 0.9717426657676697,
+      "num_tokens": 14792321.0,
+      "step": 720
+    },
+    {
+      "entropy": 0.8826817810535431,
+      "epoch": 1.3443830570902393,
+      "grad_norm": 0.8788532018661499,
+      "learning_rate": 9.807556298610404e-05,
+      "loss": 0.07579240798950196,
+      "mean_token_accuracy": 0.9706341981887817,
+      "num_tokens": 14997810.0,
+      "step": 730
+    },
+    {
+      "entropy": 0.9012470185756684,
+      "epoch": 1.3627992633517496,
+      "grad_norm": 0.7022138237953186,
+      "learning_rate": 9.799099580343441e-05,
+      "loss": 0.0775588572025299,
+      "mean_token_accuracy": 0.9699241399765015,
+      "num_tokens": 15203795.0,
+      "step": 740
+    },
+    {
+      "entropy": 0.886955714225769,
+      "epoch": 1.3812154696132597,
+      "grad_norm": 0.7881133556365967,
+      "learning_rate": 9.790464831134903e-05,
+      "loss": 0.07125020027160645,
+      "mean_token_accuracy": 0.9723815560340882,
+      "num_tokens": 15408974.0,
+      "step": 750
+    },
+    {
+      "entropy": 0.9047374844551086,
+      "epoch": 1.3996316758747698,
+      "grad_norm": 0.9082005023956299,
+      "learning_rate": 9.781652371305824e-05,
+      "loss": 0.07004334926605224,
+      "mean_token_accuracy": 0.9725580036640167,
+      "num_tokens": 15614399.0,
+      "step": 760
+    },
+    {
+      "entropy": 0.9039053857326508,
+      "epoch": 1.4180478821362799,
+      "grad_norm": 0.8060817122459412,
+      "learning_rate": 9.77266252776972e-05,
+      "loss": 0.07103485465049744,
+      "mean_token_accuracy": 0.9721468150615692,
+      "num_tokens": 15819895.0,
+      "step": 770
+    },
+    {
+      "entropy": 0.8998047232627868,
+      "epoch": 1.43646408839779,
+      "grad_norm": 1.0152642726898193,
+      "learning_rate": 9.763495634020467e-05,
+      "loss": 0.07411704063415528,
+      "mean_token_accuracy": 0.9711063146591187,
+      "num_tokens": 16025297.0,
+      "step": 780
+    },
+    {
+      "entropy": 0.9120213568210602,
+      "epoch": 1.4548802946593002,
+      "grad_norm": 0.6288319826126099,
+      "learning_rate": 9.754152030119921e-05,
+      "loss": 0.07223712205886841,
+      "mean_token_accuracy": 0.9722476422786712,
+      "num_tokens": 16230656.0,
+      "step": 790
+    },
+    {
+      "entropy": 0.9142370820045471,
+      "epoch": 1.4732965009208103,
+      "grad_norm": 0.7854700088500977,
+      "learning_rate": 9.744632062685311e-05,
+      "loss": 0.07186744809150696,
+      "mean_token_accuracy": 0.972247713804245,
+      "num_tokens": 16435943.0,
+      "step": 800
+    },
+    {
+      "entropy": 0.8920814216136932,
+      "epoch": 1.4917127071823204,
+      "grad_norm": 0.6227074265480042,
+      "learning_rate": 9.734936084876383e-05,
+      "loss": 0.07016961574554444,
+      "mean_token_accuracy": 0.9725603640079499,
+      "num_tokens": 16641635.0,
+      "step": 810
+    },
+    {
+      "entropy": 0.891328877210617,
+      "epoch": 1.5101289134438307,
+      "grad_norm": 0.7601346969604492,
+      "learning_rate": 9.725064456382283e-05,
+      "loss": 0.07137494087219239,
+      "mean_token_accuracy": 0.9722997546195984,
+      "num_tokens": 16847194.0,
+      "step": 820
+    },
+    {
+      "entropy": 0.8921217978000641,
+      "epoch": 1.5285451197053406,
+      "grad_norm": 0.7813850045204163,
+      "learning_rate": 9.715017543408233e-05,
+      "loss": 0.06890199184417725,
+      "mean_token_accuracy": 0.9735044002532959,
+      "num_tokens": 17052807.0,
+      "step": 830
+    },
+    {
+      "entropy": 0.9085914671421051,
+      "epoch": 1.5469613259668509,
+      "grad_norm": 0.6184289455413818,
+      "learning_rate": 9.704795718661939e-05,
+      "loss": 0.07043765187263488,
+      "mean_token_accuracy": 0.9725716531276702,
+      "num_tokens": 17258284.0,
+      "step": 840
+    },
+    {
+      "entropy": 0.9029861629009247,
+      "epoch": 1.565377532228361,
+      "grad_norm": 0.7082377076148987,
+      "learning_rate": 9.694399361339752e-05,
+      "loss": 0.07113839387893676,
+      "mean_token_accuracy": 0.9725669205188752,
+      "num_tokens": 17464326.0,
+      "step": 850
+    },
+    {
+      "entropy": 0.8856533527374267,
+      "epoch": 1.583793738489871,
+      "grad_norm": 0.7409216165542603,
+      "learning_rate": 9.683828857112627e-05,
+      "loss": 0.07077333331108093,
+      "mean_token_accuracy": 0.9731084644794464,
+      "num_tokens": 17669537.0,
+      "step": 860
+    },
+    {
+      "entropy": 0.8613030433654785,
+      "epoch": 1.6022099447513813,
+      "grad_norm": 0.6801561713218689,
+      "learning_rate": 9.673084598111789e-05,
+      "loss": 0.06885308027267456,
+      "mean_token_accuracy": 0.97266526222229,
+      "num_tokens": 17875289.0,
+      "step": 870
+    },
+    {
+      "entropy": 0.8692965865135193,
+      "epoch": 1.6206261510128912,
+      "grad_norm": 1.1621277332305908,
+      "learning_rate": 9.662166982914203e-05,
+      "loss": 0.07017780542373657,
+      "mean_token_accuracy": 0.9733059942722321,
+      "num_tokens": 18080404.0,
+      "step": 880
+    },
+    {
+      "entropy": 0.8671502113342285,
+      "epoch": 1.6390423572744015,
+      "grad_norm": 0.7518903613090515,
+      "learning_rate": 9.651076416527787e-05,
+      "loss": 0.06977018713951111,
+      "mean_token_accuracy": 0.9730017304420471,
+      "num_tokens": 18285699.0,
+      "step": 890
+    },
+    {
+      "entropy": 0.8662045657634735,
+      "epoch": 1.6574585635359116,
+      "grad_norm": 0.6622698903083801,
+      "learning_rate": 9.639813310376378e-05,
+      "loss": 0.06620995998382569,
+      "mean_token_accuracy": 0.9737491130828857,
+      "num_tokens": 18491097.0,
+      "step": 900
+    },
+    {
+      "entropy": 0.8548173069953918,
+      "epoch": 1.6758747697974217,
+      "grad_norm": 0.8941843509674072,
+      "learning_rate": 9.628378082284479e-05,
+      "loss": 0.06711119413375854,
+      "mean_token_accuracy": 0.9740589797496796,
+      "num_tokens": 18696827.0,
+      "step": 910
+    },
+    {
+      "entropy": 0.8763562262058258,
+      "epoch": 1.694290976058932,
+      "grad_norm": 0.7571700215339661,
+      "learning_rate": 9.616771156461755e-05,
+      "loss": 0.07263468503952027,
+      "mean_token_accuracy": 0.9717419981956482,
+      "num_tokens": 18902513.0,
+      "step": 920
+    },
+    {
+      "entropy": 0.8663733780384064,
+      "epoch": 1.7127071823204418,
+      "grad_norm": 0.7886489629745483,
+      "learning_rate": 9.604992963487298e-05,
+      "loss": 0.07074605226516724,
+      "mean_token_accuracy": 0.9724965393543243,
+      "num_tokens": 19107812.0,
+      "step": 930
+    },
+    {
+      "entropy": 0.8673004627227783,
+      "epoch": 1.7311233885819521,
+      "grad_norm": 0.8180726170539856,
+      "learning_rate": 9.593043940293647e-05,
+      "loss": 0.06831735372543335,
+      "mean_token_accuracy": 0.9733696818351746,
+      "num_tokens": 19313330.0,
+      "step": 940
+    },
+    {
+      "entropy": 0.8525971233844757,
+      "epoch": 1.7495395948434622,
+      "grad_norm": 0.6576228737831116,
+      "learning_rate": 9.580924530150595e-05,
+      "loss": 0.06567002534866333,
+      "mean_token_accuracy": 0.9745754361152649,
+      "num_tokens": 19518671.0,
+      "step": 950
+    },
+    {
+      "entropy": 0.8605451703071594,
+      "epoch": 1.7679558011049723,
+      "grad_norm": 0.7171661257743835,
+      "learning_rate": 9.568635182648725e-05,
+      "loss": 0.06872050762176514,
+      "mean_token_accuracy": 0.9732091546058654,
+      "num_tokens": 19724135.0,
+      "step": 960
+    },
+    {
+      "entropy": 0.8642210960388184,
+      "epoch": 1.7863720073664826,
+      "grad_norm": 0.7603147029876709,
+      "learning_rate": 9.556176353682746e-05,
+      "loss": 0.06766576766967773,
+      "mean_token_accuracy": 0.9728681743144989,
+      "num_tokens": 19928785.0,
+      "step": 970
+    },
+    {
+      "entropy": 0.8543185651302337,
+      "epoch": 1.8047882136279927,
+      "grad_norm": 0.7280875444412231,
+      "learning_rate": 9.543548505434581e-05,
+      "loss": 0.06851862668991089,
+      "mean_token_accuracy": 0.9737437188625335,
+      "num_tokens": 20134195.0,
+      "step": 980
+    },
+    {
+      "entropy": 0.8744745373725891,
+      "epoch": 1.8232044198895028,
+      "grad_norm": 0.5897248983383179,
+      "learning_rate": 9.530752106356209e-05,
+      "loss": 0.06809053421020508,
+      "mean_token_accuracy": 0.9733593761920929,
+      "num_tokens": 20339517.0,
+      "step": 990
+    },
+    {
+      "entropy": 0.8623859465122223,
+      "epoch": 1.8416206261510129,
+      "grad_norm": 0.7515265345573425,
+      "learning_rate": 9.517787631152298e-05,
+      "loss": 0.07257847785949707,
+      "mean_token_accuracy": 0.9714054942131043,
+      "num_tokens": 20545249.0,
+      "step": 1000
+    },
+    {
+      "entropy": 0.8669404804706573,
+      "epoch": 1.860036832412523,
+      "grad_norm": 0.7144560813903809,
+      "learning_rate": 9.504655560762596e-05,
+      "loss": 0.06832354068756104,
+      "mean_token_accuracy": 0.9735779523849487,
+      "num_tokens": 20750507.0,
+      "step": 1010
+    },
+    {
+      "entropy": 0.8493516445159912,
+      "epoch": 1.8784530386740332,
+      "grad_norm": 0.6559189558029175,
+      "learning_rate": 9.491356382344081e-05,
+      "loss": 0.0629766047000885,
+      "mean_token_accuracy": 0.9754977762699127,
+      "num_tokens": 20955956.0,
+      "step": 1020
+    },
+    {
+      "entropy": 0.8599376022815705,
+      "epoch": 1.8968692449355433,
+      "grad_norm": 0.6792973279953003,
+      "learning_rate": 9.477890589252895e-05,
+      "loss": 0.0666757881641388,
+      "mean_token_accuracy": 0.974083811044693,
+      "num_tokens": 21161163.0,
+      "step": 1030
+    },
+    {
+      "entropy": 0.8458438158035279,
+      "epoch": 1.9152854511970534,
+      "grad_norm": 0.6941778659820557,
+      "learning_rate": 9.464258681026042e-05,
+      "loss": 0.06307152509689332,
+      "mean_token_accuracy": 0.9757042229175568,
+      "num_tokens": 21366525.0,
+      "step": 1040
+    },
+    {
+      "entropy": 0.848515909910202,
+      "epoch": 1.9337016574585635,
+      "grad_norm": 0.7307806611061096,
+      "learning_rate": 9.450461163362855e-05,
+      "loss": 0.06307026147842407,
+      "mean_token_accuracy": 0.9750974595546722,
+      "num_tokens": 21572238.0,
+      "step": 1050
+    },
+    {
+      "entropy": 0.8563454031944275,
+      "epoch": 1.9521178637200736,
+      "grad_norm": 0.7222106456756592,
+      "learning_rate": 9.436498548106236e-05,
+      "loss": 0.0647726058959961,
+      "mean_token_accuracy": 0.974629694223404,
+      "num_tokens": 21777633.0,
+      "step": 1060
+    },
+    {
+      "entropy": 0.8656457483768463,
+      "epoch": 1.9705340699815839,
+      "grad_norm": 0.67178875207901,
+      "learning_rate": 9.422371353223674e-05,
+      "loss": 0.06573554277420043,
+      "mean_token_accuracy": 0.9745908617973328,
+      "num_tokens": 21983116.0,
+      "step": 1070
+    },
+    {
+      "entropy": 0.8630891263484954,
+      "epoch": 1.988950276243094,
+      "grad_norm": 0.6956593990325928,
+      "learning_rate": 9.408080102788016e-05,
+      "loss": 0.06630704402923585,
+      "mean_token_accuracy": 0.9741333484649658,
+      "num_tokens": 22188662.0,
+      "step": 1080
+    },
+    {
+      "epoch": 2.0,
+      "eval_entropy": 0.8560857042022373,
+      "eval_loss": 0.06494329869747162,
+      "eval_mean_token_accuracy": 0.9745692672936813,
+      "eval_num_tokens": 22311800.0,
+      "eval_runtime": 10.129,
+      "eval_samples_per_second": 361.142,
+      "eval_steps_per_second": 11.354,
+      "step": 1086
+    },
+    {
+      "entropy": 0.8616272270679474,
+      "epoch": 2.007366482504604,
+      "grad_norm": 0.7778105139732361,
+      "learning_rate": 9.393625326958041e-05,
+      "loss": 0.054407155513763426,
+      "mean_token_accuracy": 0.9792074799537659,
+      "num_tokens": 22394215.0,
+      "step": 1090
+    },
+    {
+      "entropy": 0.8496910452842712,
+      "epoch": 2.0257826887661143,
+      "grad_norm": 0.7422528266906738,
+      "learning_rate": 9.379007561958792e-05,
+      "loss": 0.051881587505340575,
+      "mean_token_accuracy": 0.9799090325832367,
+      "num_tokens": 22599599.0,
+      "step": 1100
+    },
+    {
+      "entropy": 0.8531602442264556,
+      "epoch": 2.044198895027624,
+      "grad_norm": 0.9075332880020142,
+      "learning_rate": 9.36422735006167e-05,
+      "loss": 0.05190724730491638,
+      "mean_token_accuracy": 0.979931116104126,
+      "num_tokens": 22805318.0,
+      "step": 1110
+    },
+    {
+      "entropy": 0.8657277703285218,
+      "epoch": 2.0626151012891345,
+      "grad_norm": 0.9466913938522339,
+      "learning_rate": 9.349285239564325e-05,
+      "loss": 0.053853434324264524,
+      "mean_token_accuracy": 0.9796103596687317,
+      "num_tokens": 23010438.0,
+      "step": 1120
+    },
+    {
+      "entropy": 0.8578485429286957,
+      "epoch": 2.0810313075506444,
+      "grad_norm": 0.6903054714202881,
+      "learning_rate": 9.334181784770326e-05,
+      "loss": 0.05228850841522217,
+      "mean_token_accuracy": 0.9802409887313843,
+      "num_tokens": 23215795.0,
+      "step": 1130
+    },
+    {
+      "entropy": 0.8450767934322357,
+      "epoch": 2.0994475138121547,
+      "grad_norm": 0.6615211367607117,
+      "learning_rate": 9.318917545968581e-05,
+      "loss": 0.050570905208587646,
+      "mean_token_accuracy": 0.9802053451538086,
+      "num_tokens": 23421157.0,
+      "step": 1140
+    },
+    {
+      "entropy": 0.8325044393539429,
+      "epoch": 2.117863720073665,
+      "grad_norm": 0.760960578918457,
+      "learning_rate": 9.303493089412564e-05,
+      "loss": 0.051966112852096555,
+      "mean_token_accuracy": 0.9796205997467041,
+      "num_tokens": 23626584.0,
+      "step": 1150
+    },
+    {
+      "entropy": 0.8416404843330383,
+      "epoch": 2.136279926335175,
+      "grad_norm": 0.6947009563446045,
+      "learning_rate": 9.287908987299306e-05,
+      "loss": 0.05144861936569214,
+      "mean_token_accuracy": 0.9800034642219544,
+      "num_tokens": 23832137.0,
+      "step": 1160
+    },
+    {
+      "entropy": 0.8564540028572083,
+      "epoch": 2.154696132596685,
+      "grad_norm": 0.733252763748169,
+      "learning_rate": 9.272165817748164e-05,
+      "loss": 0.04944799542427063,
+      "mean_token_accuracy": 0.9808157980442047,
+      "num_tokens": 24038006.0,
+      "step": 1170
+    },
+    {
+      "entropy": 0.8575525343418121,
+      "epoch": 2.1731123388581954,
+      "grad_norm": 0.8911028504371643,
+      "learning_rate": 9.25626416477938e-05,
+      "loss": 0.05037952661514282,
+      "mean_token_accuracy": 0.980946284532547,
+      "num_tokens": 24243374.0,
+      "step": 1180
+    },
+    {
+      "entropy": 0.8599720418453216,
+      "epoch": 2.1915285451197053,
+      "grad_norm": 0.7713524103164673,
+      "learning_rate": 9.240204618292416e-05,
+      "loss": 0.050603735446929934,
+      "mean_token_accuracy": 0.980896121263504,
+      "num_tokens": 24448585.0,
+      "step": 1190
+    },
+    {
+      "entropy": 0.8566664934158326,
+      "epoch": 2.2099447513812156,
+      "grad_norm": 0.8439353704452515,
+      "learning_rate": 9.223987774044066e-05,
+      "loss": 0.054171699285507205,
+      "mean_token_accuracy": 0.9796543836593627,
+      "num_tokens": 24653863.0,
+      "step": 1200
+    },
+    {
+      "entropy": 0.846601277589798,
+      "epoch": 2.2283609576427255,
+      "grad_norm": 0.7025637030601501,
+      "learning_rate": 9.207614233626356e-05,
+      "loss": 0.048924127221107484,
+      "mean_token_accuracy": 0.9809681415557862,
+      "num_tokens": 24859801.0,
+      "step": 1210
+    },
+    {
+      "entropy": 0.8564423739910125,
+      "epoch": 2.2467771639042358,
+      "grad_norm": 0.7788274884223938,
+      "learning_rate": 9.191084604444233e-05,
+      "loss": 0.05260283350944519,
+      "mean_token_accuracy": 0.9793797850608825,
+      "num_tokens": 25065368.0,
+      "step": 1220
+    },
+    {
+      "entropy": 0.865056723356247,
+      "epoch": 2.265193370165746,
+      "grad_norm": 0.8728818297386169,
+      "learning_rate": 9.174399499693027e-05,
+      "loss": 0.05016371011734009,
+      "mean_token_accuracy": 0.9807134211063385,
+      "num_tokens": 25270945.0,
+      "step": 1230
+    },
+    {
+      "entropy": 0.8642262935638427,
+      "epoch": 2.283609576427256,
+      "grad_norm": 1.0582489967346191,
+      "learning_rate": 9.157559538335703e-05,
+      "loss": 0.05316779017448425,
+      "mean_token_accuracy": 0.9794209063053131,
+      "num_tokens": 25476575.0,
+      "step": 1240
+    },
+    {
+      "entropy": 0.8677761554718018,
+      "epoch": 2.3020257826887662,
+      "grad_norm": 0.760109543800354,
+      "learning_rate": 9.140565345079901e-05,
+      "loss": 0.05115479230880737,
+      "mean_token_accuracy": 0.9802310705184937,
+      "num_tokens": 25682814.0,
+      "step": 1250
+    },
+    {
+      "entropy": 0.8592945456504821,
+      "epoch": 2.320441988950276,
+      "grad_norm": 0.6537907123565674,
+      "learning_rate": 9.123417550354761e-05,
+      "loss": 0.050543540716171266,
+      "mean_token_accuracy": 0.9806945025920868,
+      "num_tokens": 25887575.0,
+      "step": 1260
+    },
+    {
+      "entropy": 0.8692500293254852,
+      "epoch": 2.3388581952117864,
+      "grad_norm": 0.7771905064582825,
+      "learning_rate": 9.106116790287541e-05,
+      "loss": 0.049718713760375975,
+      "mean_token_accuracy": 0.9805168390274048,
+      "num_tokens": 26092950.0,
+      "step": 1270
+    },
+    {
+      "entropy": 0.8841261565685272,
+      "epoch": 2.3572744014732967,
+      "grad_norm": 0.7791076898574829,
+      "learning_rate": 9.08866370668001e-05,
+      "loss": 0.0527400553226471,
+      "mean_token_accuracy": 0.9796754539012908,
+      "num_tokens": 26298182.0,
+      "step": 1280
+    },
+    {
+      "entropy": 0.8675022900104523,
+      "epoch": 2.3756906077348066,
+      "grad_norm": 0.8481605648994446,
+      "learning_rate": 9.07105894698464e-05,
+      "loss": 0.05320838689804077,
+      "mean_token_accuracy": 0.9792274832725525,
+      "num_tokens": 26503425.0,
+      "step": 1290
+    },
+    {
+      "entropy": 0.8704026222229004,
+      "epoch": 2.394106813996317,
+      "grad_norm": 0.8235505819320679,
+      "learning_rate": 9.053303164280602e-05,
+      "loss": 0.055045205354690555,
+      "mean_token_accuracy": 0.9788750648498535,
+      "num_tokens": 26708755.0,
+      "step": 1300
+    },
+    {
+      "entropy": 0.8525134027004242,
+      "epoch": 2.4125230202578267,
+      "grad_norm": 0.7611598968505859,
+      "learning_rate": 9.035397017249518e-05,
+      "loss": 0.05029621124267578,
+      "mean_token_accuracy": 0.9802757322788238,
+      "num_tokens": 26914704.0,
+      "step": 1310
+    },
+    {
+      "entropy": 0.8630305290222168,
+      "epoch": 2.430939226519337,
+      "grad_norm": 0.790408194065094,
+      "learning_rate": 9.017341170151041e-05,
+      "loss": 0.04856040775775909,
+      "mean_token_accuracy": 0.9809690833091735,
+      "num_tokens": 27120151.0,
+      "step": 1320
+    },
+    {
+      "entropy": 0.8579159140586853,
+      "epoch": 2.4493554327808473,
+      "grad_norm": 0.781972348690033,
+      "learning_rate": 8.999136292798207e-05,
+      "loss": 0.04869682788848877,
+      "mean_token_accuracy": 0.9816130697727203,
+      "num_tokens": 27325673.0,
+      "step": 1330
+    },
+    {
+      "entropy": 0.8634716987609863,
+      "epoch": 2.467771639042357,
+      "grad_norm": 0.8500784039497375,
+      "learning_rate": 8.980783060532588e-05,
+      "loss": 0.05050289034843445,
+      "mean_token_accuracy": 0.980079609155655,
+      "num_tokens": 27531270.0,
+      "step": 1340
+    },
+    {
+      "entropy": 0.8660618126392364,
+      "epoch": 2.4861878453038675,
+      "grad_norm": 0.719760537147522,
+      "learning_rate": 8.96228215419924e-05,
+      "loss": 0.04892141819000244,
+      "mean_token_accuracy": 0.9814020991325378,
+      "num_tokens": 27736542.0,
+      "step": 1350
+    },
+    {
+      "entropy": 0.8572284400463104,
+      "epoch": 2.5046040515653774,
+      "grad_norm": 1.0197229385375977,
+      "learning_rate": 8.943634260121442e-05,
+      "loss": 0.05104702711105347,
+      "mean_token_accuracy": 0.9798846662044525,
+      "num_tokens": 27941566.0,
+      "step": 1360
+    },
+    {
+      "entropy": 0.8702241241931915,
+      "epoch": 2.5230202578268877,
+      "grad_norm": 0.7136003375053406,
+      "learning_rate": 8.924840070075247e-05,
+      "loss": 0.04855787754058838,
+      "mean_token_accuracy": 0.9811685383319855,
+      "num_tokens": 28146943.0,
+      "step": 1370
+    },
+    {
+      "entropy": 0.874957013130188,
+      "epoch": 2.541436464088398,
+      "grad_norm": 0.8775497674942017,
+      "learning_rate": 8.905900281263804e-05,
+      "loss": 0.052434295415878296,
+      "mean_token_accuracy": 0.9795438170433044,
+      "num_tokens": 28352640.0,
+      "step": 1380
+    },
+    {
+      "entropy": 0.8776536166667939,
+      "epoch": 2.559852670349908,
+      "grad_norm": 0.8895741105079651,
+      "learning_rate": 8.8868155962915e-05,
+      "loss": 0.05282890796661377,
+      "mean_token_accuracy": 0.9790538609027862,
+      "num_tokens": 28558153.0,
+      "step": 1390
+    },
+    {
+      "entropy": 0.8738743245601654,
+      "epoch": 2.578268876611418,
+      "grad_norm": 0.788800060749054,
+      "learning_rate": 8.867586723137906e-05,
+      "loss": 0.048841872811317445,
+      "mean_token_accuracy": 0.9809149026870727,
+      "num_tokens": 28763613.0,
+      "step": 1400
+    },
+    {
+      "entropy": 0.8750253796577454,
+      "epoch": 2.596685082872928,
+      "grad_norm": 0.8738002777099609,
+      "learning_rate": 8.848214375131497e-05,
+      "loss": 0.048261132836341855,
+      "mean_token_accuracy": 0.980789190530777,
+      "num_tokens": 28969248.0,
+      "step": 1410
+    },
+    {
+      "entropy": 0.8624245524406433,
+      "epoch": 2.6151012891344383,
+      "grad_norm": 0.6404895186424255,
+      "learning_rate": 8.828699270923196e-05,
+      "loss": 0.04970468282699585,
+      "mean_token_accuracy": 0.9807762265205383,
+      "num_tokens": 29174779.0,
+      "step": 1420
+    },
+    {
+      "entropy": 0.8792938470840455,
+      "epoch": 2.6335174953959486,
+      "grad_norm": 0.7856965661048889,
+      "learning_rate": 8.80904213445972e-05,
+      "loss": 0.053334391117095946,
+      "mean_token_accuracy": 0.9790222108364105,
+      "num_tokens": 29380474.0,
+      "step": 1430
+    },
+    {
+      "entropy": 0.8831034600734711,
+      "epoch": 2.6519337016574585,
+      "grad_norm": 0.7739618420600891,
+      "learning_rate": 8.789243694956716e-05,
+      "loss": 0.04959054589271546,
+      "mean_token_accuracy": 0.9803965091705322,
+      "num_tokens": 29585985.0,
+      "step": 1440
+    },
+    {
+      "entropy": 0.8934672951698304,
+      "epoch": 2.6703499079189688,
+      "grad_norm": 0.6999697089195251,
+      "learning_rate": 8.769304686871719e-05,
+      "loss": 0.05165250301361084,
+      "mean_token_accuracy": 0.9798884153366089,
+      "num_tokens": 29791238.0,
+      "step": 1450
+    },
+    {
+      "entropy": 0.9053199410438537,
+      "epoch": 2.6887661141804786,
+      "grad_norm": 0.9199564456939697,
+      "learning_rate": 8.749225849876892e-05,
+      "loss": 0.04924143850803375,
+      "mean_token_accuracy": 0.9810785710811615,
+      "num_tokens": 29996589.0,
+      "step": 1460
+    },
+    {
+      "entropy": 0.888091403245926,
+      "epoch": 2.707182320441989,
+      "grad_norm": 0.7480106353759766,
+      "learning_rate": 8.729007928831597e-05,
+      "loss": 0.04948916733264923,
+      "mean_token_accuracy": 0.9809579730033875,
+      "num_tokens": 30201875.0,
+      "step": 1470
+    },
+    {
+      "entropy": 0.8723407983779907,
+      "epoch": 2.7255985267034992,
+      "grad_norm": 0.9506945013999939,
+      "learning_rate": 8.708651673754763e-05,
+      "loss": 0.048927539587020875,
+      "mean_token_accuracy": 0.980553150177002,
+      "num_tokens": 30407550.0,
+      "step": 1480
+    },
+    {
+      "entropy": 0.8737521529197693,
+      "epoch": 2.744014732965009,
+      "grad_norm": 0.8015706539154053,
+      "learning_rate": 8.688157839797062e-05,
+      "loss": 0.04963063597679138,
+      "mean_token_accuracy": 0.9809738755226135,
+      "num_tokens": 30612839.0,
+      "step": 1490
+    },
+    {
+      "entropy": 0.8800762951374054,
+      "epoch": 2.7624309392265194,
+      "grad_norm": 0.9429986476898193,
+      "learning_rate": 8.667527187212885e-05,
+      "loss": 0.0524174690246582,
+      "mean_token_accuracy": 0.9788767337799072,
+      "num_tokens": 30818578.0,
+      "step": 1500
+    },
+    {
+      "entropy": 0.8871055901050567,
+      "epoch": 2.7808471454880292,
+      "grad_norm": 0.5909196138381958,
+      "learning_rate": 8.646760481332157e-05,
+      "loss": 0.05166680812835693,
+      "mean_token_accuracy": 0.980216771364212,
+      "num_tokens": 31023829.0,
+      "step": 1510
+    },
+    {
+      "entropy": 0.8908755779266357,
+      "epoch": 2.7992633517495396,
+      "grad_norm": 0.9154611229896545,
+      "learning_rate": 8.625858492531931e-05,
+      "loss": 0.04951836466789246,
+      "mean_token_accuracy": 0.9801484227180481,
+      "num_tokens": 31229635.0,
+      "step": 1520
+    },
+    {
+      "entropy": 0.92480548620224,
+      "epoch": 2.81767955801105,
+      "grad_norm": 0.5989938378334045,
+      "learning_rate": 8.604821996207819e-05,
+      "loss": 0.04799881279468536,
+      "mean_token_accuracy": 0.9817522585391998,
+      "num_tokens": 31435456.0,
+      "step": 1530
+    },
+    {
+      "entropy": 0.9173881888389588,
+      "epoch": 2.8360957642725597,
+      "grad_norm": 0.899413526058197,
+      "learning_rate": 8.58365177274522e-05,
+      "loss": 0.0487445592880249,
+      "mean_token_accuracy": 0.9812625288963318,
+      "num_tokens": 31640904.0,
+      "step": 1540
+    },
+    {
+      "entropy": 0.9076135993003845,
+      "epoch": 2.85451197053407,
+      "grad_norm": 0.8494166135787964,
+      "learning_rate": 8.562348607490376e-05,
+      "loss": 0.05005228519439697,
+      "mean_token_accuracy": 0.9806681036949157,
+      "num_tokens": 31845807.0,
+      "step": 1550
+    },
+    {
+      "entropy": 0.9092245221138,
+      "epoch": 2.87292817679558,
+      "grad_norm": 0.8225123286247253,
+      "learning_rate": 8.540913290721234e-05,
+      "loss": 0.048654764890670776,
+      "mean_token_accuracy": 0.9805659353733063,
+      "num_tokens": 32051523.0,
+      "step": 1560
+    },
+    {
+      "entropy": 0.9062779664993286,
+      "epoch": 2.89134438305709,
+      "grad_norm": 0.7074014544487,
+      "learning_rate": 8.519346617618134e-05,
+      "loss": 0.049209845066070554,
+      "mean_token_accuracy": 0.9807434439659118,
+      "num_tokens": 32256895.0,
+      "step": 1570
+    },
+    {
+      "entropy": 0.9190246641635895,
+      "epoch": 2.9097605893186005,
+      "grad_norm": 0.8860642910003662,
+      "learning_rate": 8.497649388234304e-05,
+      "loss": 0.051211881637573245,
+      "mean_token_accuracy": 0.9802342295646668,
+      "num_tokens": 32462031.0,
+      "step": 1580
+    },
+    {
+      "entropy": 0.9088015079498291,
+      "epoch": 2.9281767955801103,
+      "grad_norm": 0.8062726855278015,
+      "learning_rate": 8.475822407466188e-05,
+      "loss": 0.053512704372406,
+      "mean_token_accuracy": 0.979486483335495,
+      "num_tokens": 32667533.0,
+      "step": 1590
+    },
+    {
+      "entropy": 0.9462027847766876,
+      "epoch": 2.9465930018416207,
+      "grad_norm": 0.7962909936904907,
+      "learning_rate": 8.453866485023579e-05,
+      "loss": 0.0501457154750824,
+      "mean_token_accuracy": 0.9803222417831421,
+      "num_tokens": 32872900.0,
+      "step": 1600
+    },
+    {
+      "entropy": 0.9671471297740937,
+      "epoch": 2.9650092081031305,
+      "grad_norm": 0.7641744017601013,
+      "learning_rate": 8.431782435399587e-05,
+      "loss": 0.04629061222076416,
+      "mean_token_accuracy": 0.9823175370693207,
+      "num_tokens": 33077850.0,
+      "step": 1610
+    },
+    {
+      "entropy": 0.955865204334259,
+      "epoch": 2.983425414364641,
+      "grad_norm": 0.6772348880767822,
+      "learning_rate": 8.409571077840426e-05,
+      "loss": 0.048368623852729796,
+      "mean_token_accuracy": 0.9808700799942016,
+      "num_tokens": 33283117.0,
+      "step": 1620
+    },
+    {
+      "epoch": 3.0,
+      "eval_entropy": 0.9563225186389426,
+      "eval_loss": 0.059064481407403946,
+      "eval_mean_token_accuracy": 0.9773589429648026,
+      "eval_num_tokens": 33467712.0,
+      "eval_runtime": 10.1471,
+      "eval_samples_per_second": 360.499,
+      "eval_steps_per_second": 11.333,
+      "step": 1629
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 5430,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 10,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 1.595677368674943e+18,
+  "train_batch_size": 32,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-1629/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21325c9bdff5ed34f0cc34837ee67ed216c9301ab4d9b2e26f048b563564bd75
+size 5777

checkpoint-2172/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-2172/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-2172/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3c86f8f8223f27673f137496bfe71dc599b6baf7e185de17ad979b78a2ac98e6
+size 80792096

checkpoint-2172/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-2172/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

checkpoint-2172/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

checkpoint-2172/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2248 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 4.0,
+  "eval_steps": 500,
+  "global_step": 2172,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "entropy": 1.2237394809722901,
+      "epoch": 0.01841620626151013,
+      "grad_norm": 5.082435607910156,
+      "learning_rate": 3.308823529411765e-06,
+      "loss": 0.9237876892089844,
+      "mean_token_accuracy": 0.7685343027114868,
+      "num_tokens": 205423.0,
+      "step": 10
+    },
+    {
+      "entropy": 1.2295925617218018,
+      "epoch": 0.03683241252302026,
+      "grad_norm": 4.672000408172607,
+      "learning_rate": 6.985294117647059e-06,
+      "loss": 0.8900892257690429,
+      "mean_token_accuracy": 0.7677771031856537,
+      "num_tokens": 410849.0,
+      "step": 20
+    },
+    {
+      "entropy": 1.2285718679428101,
+      "epoch": 0.055248618784530384,
+      "grad_norm": 1.4828118085861206,
+      "learning_rate": 1.0661764705882354e-05,
+      "loss": 0.5975452899932862,
+      "mean_token_accuracy": 0.8146551787853241,
+      "num_tokens": 616438.0,
+      "step": 30
+    },
+    {
+      "entropy": 1.210776400566101,
+      "epoch": 0.07366482504604052,
+      "grad_norm": 0.7761328816413879,
+      "learning_rate": 1.4338235294117647e-05,
+      "loss": 0.40664992332458494,
+      "mean_token_accuracy": 0.8699092030525207,
+      "num_tokens": 822118.0,
+      "step": 40
+    },
+    {
+      "entropy": 1.200321125984192,
+      "epoch": 0.09208103130755065,
+      "grad_norm": 0.5363371968269348,
+      "learning_rate": 1.8014705882352943e-05,
+      "loss": 0.3313469409942627,
+      "mean_token_accuracy": 0.8904915869235992,
+      "num_tokens": 1027941.0,
+      "step": 50
+    },
+    {
+      "entropy": 1.1809936046600342,
+      "epoch": 0.11049723756906077,
+      "grad_norm": 0.39541518688201904,
+      "learning_rate": 2.1691176470588237e-05,
+      "loss": 0.27568228244781495,
+      "mean_token_accuracy": 0.9047131836414337,
+      "num_tokens": 1233620.0,
+      "step": 60
+    },
+    {
+      "entropy": 1.169810914993286,
+      "epoch": 0.1289134438305709,
+      "grad_norm": 0.341960072517395,
+      "learning_rate": 2.536764705882353e-05,
+      "loss": 0.245219087600708,
+      "mean_token_accuracy": 0.9150686681270599,
+      "num_tokens": 1438656.0,
+      "step": 70
+    },
+    {
+      "entropy": 1.1652960777282715,
+      "epoch": 0.14732965009208104,
+      "grad_norm": 0.36872178316116333,
+      "learning_rate": 2.9044117647058828e-05,
+      "loss": 0.2220149040222168,
+      "mean_token_accuracy": 0.9224777698516846,
+      "num_tokens": 1643877.0,
+      "step": 80
+    },
+    {
+      "entropy": 1.154341197013855,
+      "epoch": 0.16574585635359115,
+      "grad_norm": 0.4152425229549408,
+      "learning_rate": 3.272058823529412e-05,
+      "loss": 0.2002798557281494,
+      "mean_token_accuracy": 0.9285802960395813,
+      "num_tokens": 1849506.0,
+      "step": 90
+    },
+    {
+      "entropy": 1.1507258892059327,
+      "epoch": 0.1841620626151013,
+      "grad_norm": 0.47647765278816223,
+      "learning_rate": 3.639705882352941e-05,
+      "loss": 0.18871363401412963,
+      "mean_token_accuracy": 0.9318056285381318,
+      "num_tokens": 2055071.0,
+      "step": 100
+    },
+    {
+      "entropy": 1.1455535531044005,
+      "epoch": 0.20257826887661143,
+      "grad_norm": 0.4853009581565857,
+      "learning_rate": 4.007352941176471e-05,
+      "loss": 0.17836341857910157,
+      "mean_token_accuracy": 0.9367631554603577,
+      "num_tokens": 2260643.0,
+      "step": 110
+    },
+    {
+      "entropy": 1.1402526497840881,
+      "epoch": 0.22099447513812154,
+      "grad_norm": 0.4455392360687256,
+      "learning_rate": 4.375e-05,
+      "loss": 0.16921783685684205,
+      "mean_token_accuracy": 0.9386959195137023,
+      "num_tokens": 2466085.0,
+      "step": 120
+    },
+    {
+      "entropy": 1.1374777555465698,
+      "epoch": 0.23941068139963168,
+      "grad_norm": 0.5880279541015625,
+      "learning_rate": 4.742647058823529e-05,
+      "loss": 0.15989291667938232,
+      "mean_token_accuracy": 0.9421182632446289,
+      "num_tokens": 2671024.0,
+      "step": 130
+    },
+    {
+      "entropy": 1.1273940205574036,
+      "epoch": 0.2578268876611418,
+      "grad_norm": 0.612959086894989,
+      "learning_rate": 5.110294117647059e-05,
+      "loss": 0.14701461791992188,
+      "mean_token_accuracy": 0.9463540315628052,
+      "num_tokens": 2876848.0,
+      "step": 140
+    },
+    {
+      "entropy": 1.1263513088226318,
+      "epoch": 0.27624309392265195,
+      "grad_norm": 0.5695255398750305,
+      "learning_rate": 5.477941176470589e-05,
+      "loss": 0.14604382514953612,
+      "mean_token_accuracy": 0.946351945400238,
+      "num_tokens": 3082589.0,
+      "step": 150
+    },
+    {
+      "entropy": 1.1290789365768432,
+      "epoch": 0.2946593001841621,
+      "grad_norm": 0.6608090996742249,
+      "learning_rate": 5.845588235294118e-05,
+      "loss": 0.1409450054168701,
+      "mean_token_accuracy": 0.9481450319290161,
+      "num_tokens": 3287459.0,
+      "step": 160
+    },
+    {
+      "entropy": 1.1291529774665832,
+      "epoch": 0.31307550644567217,
+      "grad_norm": 0.652715802192688,
+      "learning_rate": 6.213235294117647e-05,
+      "loss": 0.14441155195236205,
+      "mean_token_accuracy": 0.9466125547885895,
+      "num_tokens": 3493682.0,
+      "step": 170
+    },
+    {
+      "entropy": 1.1244838953018188,
+      "epoch": 0.3314917127071823,
+      "grad_norm": 0.7815241813659668,
+      "learning_rate": 6.580882352941177e-05,
+      "loss": 0.13361064195632935,
+      "mean_token_accuracy": 0.9512295544147491,
+      "num_tokens": 3699573.0,
+      "step": 180
+    },
+    {
+      "entropy": 1.1217721104621887,
+      "epoch": 0.34990791896869244,
+      "grad_norm": 0.7933160066604614,
+      "learning_rate": 6.948529411764706e-05,
+      "loss": 0.13089522123336791,
+      "mean_token_accuracy": 0.9520221531391144,
+      "num_tokens": 3905156.0,
+      "step": 190
+    },
+    {
+      "entropy": 1.1206679105758668,
+      "epoch": 0.3683241252302026,
+      "grad_norm": 0.6815240383148193,
+      "learning_rate": 7.316176470588236e-05,
+      "loss": 0.13400404453277587,
+      "mean_token_accuracy": 0.9501322209835052,
+      "num_tokens": 4110570.0,
+      "step": 200
+    },
+    {
+      "entropy": 1.1161052227020263,
+      "epoch": 0.3867403314917127,
+      "grad_norm": 0.8297767639160156,
+      "learning_rate": 7.683823529411766e-05,
+      "loss": 0.13389937877655028,
+      "mean_token_accuracy": 0.9501932203769684,
+      "num_tokens": 4315834.0,
+      "step": 210
+    },
+    {
+      "entropy": 1.1098745942115784,
+      "epoch": 0.40515653775322286,
+      "grad_norm": 0.5943381786346436,
+      "learning_rate": 8.051470588235294e-05,
+      "loss": 0.13452907800674438,
+      "mean_token_accuracy": 0.9503286242485046,
+      "num_tokens": 4520807.0,
+      "step": 220
+    },
+    {
+      "entropy": 1.100480353832245,
+      "epoch": 0.42357274401473294,
+      "grad_norm": 0.6094359755516052,
+      "learning_rate": 8.419117647058824e-05,
+      "loss": 0.12827746868133544,
+      "mean_token_accuracy": 0.952492094039917,
+      "num_tokens": 4725867.0,
+      "step": 230
+    },
+    {
+      "entropy": 1.0901286959648133,
+      "epoch": 0.4419889502762431,
+      "grad_norm": 0.7240597605705261,
+      "learning_rate": 8.786764705882353e-05,
+      "loss": 0.12171242237091065,
+      "mean_token_accuracy": 0.953943532705307,
+      "num_tokens": 4931629.0,
+      "step": 240
+    },
+    {
+      "entropy": 1.0885071873664856,
+      "epoch": 0.4604051565377532,
+      "grad_norm": 0.6939547657966614,
+      "learning_rate": 9.154411764705882e-05,
+      "loss": 0.12155698537826538,
+      "mean_token_accuracy": 0.9545870959758759,
+      "num_tokens": 5137285.0,
+      "step": 250
+    },
+    {
+      "entropy": 1.086272156238556,
+      "epoch": 0.47882136279926335,
+      "grad_norm": 0.5752800703048706,
+      "learning_rate": 9.522058823529412e-05,
+      "loss": 0.12157790660858155,
+      "mean_token_accuracy": 0.9541126549243927,
+      "num_tokens": 5342575.0,
+      "step": 260
+    },
+    {
+      "entropy": 1.0857678413391114,
+      "epoch": 0.4972375690607735,
+      "grad_norm": 0.7565123438835144,
+      "learning_rate": 9.889705882352942e-05,
+      "loss": 0.12349612712860107,
+      "mean_token_accuracy": 0.9535140514373779,
+      "num_tokens": 5547995.0,
+      "step": 270
+    },
+    {
+      "entropy": 1.079762625694275,
+      "epoch": 0.5156537753222836,
+      "grad_norm": 0.6972768306732178,
+      "learning_rate": 9.999954556423843e-05,
+      "loss": 0.11875582933425903,
+      "mean_token_accuracy": 0.9556483089923858,
+      "num_tokens": 5753195.0,
+      "step": 280
+    },
+    {
+      "entropy": 1.0742079138755798,
+      "epoch": 0.5340699815837937,
+      "grad_norm": 0.7821696996688843,
+      "learning_rate": 9.999731977631227e-05,
+      "loss": 0.11824090480804443,
+      "mean_token_accuracy": 0.9557521045207977,
+      "num_tokens": 5958236.0,
+      "step": 290
+    },
+    {
+      "entropy": 1.0679773569107056,
+      "epoch": 0.5524861878453039,
+      "grad_norm": 0.5846888422966003,
+      "learning_rate": 9.999323925089486e-05,
+      "loss": 0.11707355976104736,
+      "mean_token_accuracy": 0.9554719448089599,
+      "num_tokens": 6163992.0,
+      "step": 300
+    },
+    {
+      "entropy": 1.0655727863311768,
+      "epoch": 0.570902394106814,
+      "grad_norm": 0.5812502503395081,
+      "learning_rate": 9.998730413936037e-05,
+      "loss": 0.11371417045593261,
+      "mean_token_accuracy": 0.9576376020908356,
+      "num_tokens": 6369456.0,
+      "step": 310
+    },
+    {
+      "entropy": 1.0607039332389832,
+      "epoch": 0.5893186003683242,
+      "grad_norm": 0.6238475441932678,
+      "learning_rate": 9.99795146618821e-05,
+      "loss": 0.11775733232498169,
+      "mean_token_accuracy": 0.9557221591472626,
+      "num_tokens": 6574833.0,
+      "step": 320
+    },
+    {
+      "entropy": 1.0504255175590516,
+      "epoch": 0.6077348066298343,
+      "grad_norm": 0.6496815085411072,
+      "learning_rate": 9.996987110742422e-05,
+      "loss": 0.10904088020324706,
+      "mean_token_accuracy": 0.9585366368293762,
+      "num_tokens": 6780108.0,
+      "step": 330
+    },
+    {
+      "entropy": 1.0456081986427308,
+      "epoch": 0.6261510128913443,
+      "grad_norm": 0.786702573299408,
+      "learning_rate": 9.995837383373119e-05,
+      "loss": 0.10642309188842773,
+      "mean_token_accuracy": 0.9596696078777314,
+      "num_tokens": 6985920.0,
+      "step": 340
+    },
+    {
+      "entropy": 1.0455098271369934,
+      "epoch": 0.6445672191528545,
+      "grad_norm": 0.5473790168762207,
+      "learning_rate": 9.994502326731434e-05,
+      "loss": 0.10822961330413819,
+      "mean_token_accuracy": 0.959563136100769,
+      "num_tokens": 7191465.0,
+      "step": 350
+    },
+    {
+      "entropy": 1.04240562915802,
+      "epoch": 0.6629834254143646,
+      "grad_norm": 0.6672356128692627,
+      "learning_rate": 9.992981990343614e-05,
+      "loss": 0.1110004186630249,
+      "mean_token_accuracy": 0.9582514643669129,
+      "num_tokens": 7396877.0,
+      "step": 360
+    },
+    {
+      "entropy": 1.0386811256408692,
+      "epoch": 0.6813996316758748,
+      "grad_norm": 0.698539674282074,
+      "learning_rate": 9.99127643060918e-05,
+      "loss": 0.107539963722229,
+      "mean_token_accuracy": 0.9593036234378814,
+      "num_tokens": 7602437.0,
+      "step": 370
+    },
+    {
+      "entropy": 1.0311225533485413,
+      "epoch": 0.6998158379373849,
+      "grad_norm": 0.6629284024238586,
+      "learning_rate": 9.989385710798837e-05,
+      "loss": 0.1064023494720459,
+      "mean_token_accuracy": 0.9602205216884613,
+      "num_tokens": 7808142.0,
+      "step": 380
+    },
+    {
+      "entropy": 1.030210506916046,
+      "epoch": 0.7182320441988951,
+      "grad_norm": 0.5616748929023743,
+      "learning_rate": 9.987309901052121e-05,
+      "loss": 0.10717041492462158,
+      "mean_token_accuracy": 0.9599347949028015,
+      "num_tokens": 8013407.0,
+      "step": 390
+    },
+    {
+      "entropy": 1.0208017826080322,
+      "epoch": 0.7366482504604052,
+      "grad_norm": 0.6329049468040466,
+      "learning_rate": 9.985049078374806e-05,
+      "loss": 0.10359601974487305,
+      "mean_token_accuracy": 0.9603756129741668,
+      "num_tokens": 8219040.0,
+      "step": 400
+    },
+    {
+      "entropy": 1.015640377998352,
+      "epoch": 0.7550644567219152,
+      "grad_norm": 0.6516013741493225,
+      "learning_rate": 9.982603326636037e-05,
+      "loss": 0.10146439075469971,
+      "mean_token_accuracy": 0.9627702474594116,
+      "num_tokens": 8424678.0,
+      "step": 410
+    },
+    {
+      "entropy": 1.0105359435081482,
+      "epoch": 0.7734806629834254,
+      "grad_norm": 0.6920603513717651,
+      "learning_rate": 9.979972736565226e-05,
+      "loss": 0.10770498514175415,
+      "mean_token_accuracy": 0.9591470420360565,
+      "num_tokens": 8629868.0,
+      "step": 420
+    },
+    {
+      "entropy": 0.9966452836990356,
+      "epoch": 0.7918968692449355,
+      "grad_norm": 0.6857476234436035,
+      "learning_rate": 9.977157405748687e-05,
+      "loss": 0.10282524824142455,
+      "mean_token_accuracy": 0.9612209022045135,
+      "num_tokens": 8835320.0,
+      "step": 430
+    },
+    {
+      "entropy": 0.9945534646511078,
+      "epoch": 0.8103130755064457,
+      "grad_norm": 0.7208472490310669,
+      "learning_rate": 9.974157438626008e-05,
+      "loss": 0.10069938898086547,
+      "mean_token_accuracy": 0.9620070576667785,
+      "num_tokens": 9041123.0,
+      "step": 440
+    },
+    {
+      "entropy": 0.979461395740509,
+      "epoch": 0.8287292817679558,
+      "grad_norm": 0.5071915984153748,
+      "learning_rate": 9.970972946486185e-05,
+      "loss": 0.09799174070358277,
+      "mean_token_accuracy": 0.9620374023914338,
+      "num_tokens": 9246361.0,
+      "step": 450
+    },
+    {
+      "entropy": 0.9830998003482818,
+      "epoch": 0.8471454880294659,
+      "grad_norm": 0.8660802245140076,
+      "learning_rate": 9.967604047463493e-05,
+      "loss": 0.10378165245056152,
+      "mean_token_accuracy": 0.9606865763664245,
+      "num_tokens": 9451845.0,
+      "step": 460
+    },
+    {
+      "entropy": 0.9813413023948669,
+      "epoch": 0.8655616942909761,
+      "grad_norm": 0.7642477750778198,
+      "learning_rate": 9.964050866533094e-05,
+      "loss": 0.1010061264038086,
+      "mean_token_accuracy": 0.9608745336532593,
+      "num_tokens": 9656802.0,
+      "step": 470
+    },
+    {
+      "entropy": 0.967874163389206,
+      "epoch": 0.8839779005524862,
+      "grad_norm": 0.5987281799316406,
+      "learning_rate": 9.960313535506411e-05,
+      "loss": 0.10169394016265869,
+      "mean_token_accuracy": 0.9611998200416565,
+      "num_tokens": 9861719.0,
+      "step": 480
+    },
+    {
+      "entropy": 0.9663491308689117,
+      "epoch": 0.9023941068139963,
+      "grad_norm": 0.6124638319015503,
+      "learning_rate": 9.956392193026239e-05,
+      "loss": 0.102389657497406,
+      "mean_token_accuracy": 0.9611884355545044,
+      "num_tokens": 10066673.0,
+      "step": 490
+    },
+    {
+      "entropy": 0.959654438495636,
+      "epoch": 0.9208103130755064,
+      "grad_norm": 0.7873051762580872,
+      "learning_rate": 9.952286984561592e-05,
+      "loss": 0.10170392990112305,
+      "mean_token_accuracy": 0.9610928475856781,
+      "num_tokens": 10272091.0,
+      "step": 500
+    },
+    {
+      "entropy": 0.9550537407398224,
+      "epoch": 0.9392265193370166,
+      "grad_norm": 0.6071968078613281,
+      "learning_rate": 9.947998062402313e-05,
+      "loss": 0.09448277950286865,
+      "mean_token_accuracy": 0.9648977637290954,
+      "num_tokens": 10477632.0,
+      "step": 510
+    },
+    {
+      "entropy": 0.9538533687591553,
+      "epoch": 0.9576427255985267,
+      "grad_norm": 0.6317242980003357,
+      "learning_rate": 9.943525585653428e-05,
+      "loss": 0.09542192220687866,
+      "mean_token_accuracy": 0.9635261118412017,
+      "num_tokens": 10682828.0,
+      "step": 520
+    },
+    {
+      "entropy": 0.9362513542175293,
+      "epoch": 0.9760589318600368,
+      "grad_norm": 0.6421944499015808,
+      "learning_rate": 9.938869720229234e-05,
+      "loss": 0.09382058382034301,
+      "mean_token_accuracy": 0.9648073971271515,
+      "num_tokens": 10888741.0,
+      "step": 530
+    },
+    {
+      "entropy": 0.9235438346862793,
+      "epoch": 0.994475138121547,
+      "grad_norm": 0.7986873388290405,
+      "learning_rate": 9.934030638847155e-05,
+      "loss": 0.09827429056167603,
+      "mean_token_accuracy": 0.9621128737926483,
+      "num_tokens": 11094387.0,
+      "step": 540
+    },
+    {
+      "epoch": 1.0,
+      "eval_entropy": 0.9137652366057686,
+      "eval_loss": 0.09368764609098434,
+      "eval_mean_token_accuracy": 0.9640816880309063,
+      "eval_num_tokens": 11155908.0,
+      "eval_runtime": 10.4701,
+      "eval_samples_per_second": 349.377,
+      "eval_steps_per_second": 10.984,
+      "step": 543
+    },
+    {
+      "entropy": 0.9047818422317505,
+      "epoch": 1.0128913443830572,
+      "grad_norm": 0.6781501173973083,
+      "learning_rate": 9.929008521021325e-05,
+      "loss": 0.0863916516304016,
+      "mean_token_accuracy": 0.9673655688762665,
+      "num_tokens": 11299715.0,
+      "step": 550
+    },
+    {
+      "entropy": 0.8856981039047241,
+      "epoch": 1.0313075506445673,
+      "grad_norm": 0.7143136858940125,
+      "learning_rate": 9.923803553055937e-05,
+      "loss": 0.08632323145866394,
+      "mean_token_accuracy": 0.9677783191204071,
+      "num_tokens": 11505059.0,
+      "step": 560
+    },
+    {
+      "entropy": 0.8937099635601043,
+      "epoch": 1.0497237569060773,
+      "grad_norm": 0.7751694321632385,
+      "learning_rate": 9.918415928038325e-05,
+      "loss": 0.08178263902664185,
+      "mean_token_accuracy": 0.9694291114807129,
+      "num_tokens": 11710464.0,
+      "step": 570
+    },
+    {
+      "entropy": 0.8858704209327698,
+      "epoch": 1.0681399631675874,
+      "grad_norm": 0.7492292523384094,
+      "learning_rate": 9.912845845831805e-05,
+      "loss": 0.08074211478233337,
+      "mean_token_accuracy": 0.9692470014095307,
+      "num_tokens": 11915959.0,
+      "step": 580
+    },
+    {
+      "entropy": 0.8948039829730987,
+      "epoch": 1.0865561694290977,
+      "grad_norm": 0.8116479516029358,
+      "learning_rate": 9.907093513068259e-05,
+      "loss": 0.08712012171745301,
+      "mean_token_accuracy": 0.9669980227947235,
+      "num_tokens": 12121499.0,
+      "step": 590
+    },
+    {
+      "entropy": 0.8846789538860321,
+      "epoch": 1.1049723756906078,
+      "grad_norm": 0.7295626997947693,
+      "learning_rate": 9.901159143140471e-05,
+      "loss": 0.08444435596466064,
+      "mean_token_accuracy": 0.9674544095993042,
+      "num_tokens": 12327061.0,
+      "step": 600
+    },
+    {
+      "entropy": 0.8734103918075562,
+      "epoch": 1.1233885819521179,
+      "grad_norm": 0.9585768580436707,
+      "learning_rate": 9.89504295619421e-05,
+      "loss": 0.08022565841674804,
+      "mean_token_accuracy": 0.969569206237793,
+      "num_tokens": 12532305.0,
+      "step": 610
+    },
+    {
+      "entropy": 0.8640486001968384,
+      "epoch": 1.141804788213628,
+      "grad_norm": 0.7891159057617188,
+      "learning_rate": 9.88874517912006e-05,
+      "loss": 0.08415375947952271,
+      "mean_token_accuracy": 0.9678892493247986,
+      "num_tokens": 12737828.0,
+      "step": 620
+    },
+    {
+      "entropy": 0.8599755525588989,
+      "epoch": 1.160220994475138,
+      "grad_norm": 0.5801345109939575,
+      "learning_rate": 9.882266045545012e-05,
+      "loss": 0.08100489974021911,
+      "mean_token_accuracy": 0.9688023269176483,
+      "num_tokens": 12943343.0,
+      "step": 630
+    },
+    {
+      "entropy": 0.86524977684021,
+      "epoch": 1.1786372007366483,
+      "grad_norm": 0.7633041143417358,
+      "learning_rate": 9.87560579582379e-05,
+      "loss": 0.07859406471252442,
+      "mean_token_accuracy": 0.9702189445495606,
+      "num_tokens": 13148473.0,
+      "step": 640
+    },
+    {
+      "entropy": 0.8466695249080658,
+      "epoch": 1.1970534069981584,
+      "grad_norm": 0.8672215938568115,
+      "learning_rate": 9.868764677029934e-05,
+      "loss": 0.08082623481750488,
+      "mean_token_accuracy": 0.9689972400665283,
+      "num_tokens": 13353890.0,
+      "step": 650
+    },
+    {
+      "entropy": 0.8596941530704498,
+      "epoch": 1.2154696132596685,
+      "grad_norm": 0.7524124383926392,
+      "learning_rate": 9.861742942946639e-05,
+      "loss": 0.0789935290813446,
+      "mean_token_accuracy": 0.9693858206272126,
+      "num_tokens": 13559475.0,
+      "step": 660
+    },
+    {
+      "entropy": 0.8708749234676361,
+      "epoch": 1.2338858195211786,
+      "grad_norm": 0.5777031183242798,
+      "learning_rate": 9.854540854057337e-05,
+      "loss": 0.07773642539978028,
+      "mean_token_accuracy": 0.970385092496872,
+      "num_tokens": 13765076.0,
+      "step": 670
+    },
+    {
+      "entropy": 0.8651713371276856,
+      "epoch": 1.2523020257826887,
+      "grad_norm": 0.7924166321754456,
+      "learning_rate": 9.847158677536034e-05,
+      "loss": 0.0766686737537384,
+      "mean_token_accuracy": 0.9702267110347748,
+      "num_tokens": 13970642.0,
+      "step": 680
+    },
+    {
+      "entropy": 0.8763024985790253,
+      "epoch": 1.270718232044199,
+      "grad_norm": 0.741219162940979,
+      "learning_rate": 9.839596687237403e-05,
+      "loss": 0.07189929485321045,
+      "mean_token_accuracy": 0.9727097094058991,
+      "num_tokens": 14176556.0,
+      "step": 690
+    },
+    {
+      "entropy": 0.8556921362876893,
+      "epoch": 1.289134438305709,
+      "grad_norm": 0.6298198103904724,
+      "learning_rate": 9.831855163686618e-05,
+      "loss": 0.07608137726783752,
+      "mean_token_accuracy": 0.9716399371623993,
+      "num_tokens": 14381686.0,
+      "step": 700
+    },
+    {
+      "entropy": 0.869178420305252,
+      "epoch": 1.3075506445672191,
+      "grad_norm": 0.5850273370742798,
+      "learning_rate": 9.823934394068952e-05,
+      "loss": 0.07437651753425598,
+      "mean_token_accuracy": 0.9709566533565521,
+      "num_tokens": 14586814.0,
+      "step": 710
+    },
+    {
+      "entropy": 0.8708595156669616,
+      "epoch": 1.3259668508287292,
+      "grad_norm": 0.6580632328987122,
+      "learning_rate": 9.815834672219127e-05,
+      "loss": 0.07518917322158813,
+      "mean_token_accuracy": 0.9717426657676697,
+      "num_tokens": 14792321.0,
+      "step": 720
+    },
+    {
+      "entropy": 0.8826817810535431,
+      "epoch": 1.3443830570902393,
+      "grad_norm": 0.8788532018661499,
+      "learning_rate": 9.807556298610404e-05,
+      "loss": 0.07579240798950196,
+      "mean_token_accuracy": 0.9706341981887817,
+      "num_tokens": 14997810.0,
+      "step": 730
+    },
+    {
+      "entropy": 0.9012470185756684,
+      "epoch": 1.3627992633517496,
+      "grad_norm": 0.7022138237953186,
+      "learning_rate": 9.799099580343441e-05,
+      "loss": 0.0775588572025299,
+      "mean_token_accuracy": 0.9699241399765015,
+      "num_tokens": 15203795.0,
+      "step": 740
+    },
+    {
+      "entropy": 0.886955714225769,
+      "epoch": 1.3812154696132597,
+      "grad_norm": 0.7881133556365967,
+      "learning_rate": 9.790464831134903e-05,
+      "loss": 0.07125020027160645,
+      "mean_token_accuracy": 0.9723815560340882,
+      "num_tokens": 15408974.0,
+      "step": 750
+    },
+    {
+      "entropy": 0.9047374844551086,
+      "epoch": 1.3996316758747698,
+      "grad_norm": 0.9082005023956299,
+      "learning_rate": 9.781652371305824e-05,
+      "loss": 0.07004334926605224,
+      "mean_token_accuracy": 0.9725580036640167,
+      "num_tokens": 15614399.0,
+      "step": 760
+    },
+    {
+      "entropy": 0.9039053857326508,
+      "epoch": 1.4180478821362799,
+      "grad_norm": 0.8060817122459412,
+      "learning_rate": 9.77266252776972e-05,
+      "loss": 0.07103485465049744,
+      "mean_token_accuracy": 0.9721468150615692,
+      "num_tokens": 15819895.0,
+      "step": 770
+    },
+    {
+      "entropy": 0.8998047232627868,
+      "epoch": 1.43646408839779,
+      "grad_norm": 1.0152642726898193,
+      "learning_rate": 9.763495634020467e-05,
+      "loss": 0.07411704063415528,
+      "mean_token_accuracy": 0.9711063146591187,
+      "num_tokens": 16025297.0,
+      "step": 780
+    },
+    {
+      "entropy": 0.9120213568210602,
+      "epoch": 1.4548802946593002,
+      "grad_norm": 0.6288319826126099,
+      "learning_rate": 9.754152030119921e-05,
+      "loss": 0.07223712205886841,
+      "mean_token_accuracy": 0.9722476422786712,
+      "num_tokens": 16230656.0,
+      "step": 790
+    },
+    {
+      "entropy": 0.9142370820045471,
+      "epoch": 1.4732965009208103,
+      "grad_norm": 0.7854700088500977,
+      "learning_rate": 9.744632062685311e-05,
+      "loss": 0.07186744809150696,
+      "mean_token_accuracy": 0.972247713804245,
+      "num_tokens": 16435943.0,
+      "step": 800
+    },
+    {
+      "entropy": 0.8920814216136932,
+      "epoch": 1.4917127071823204,
+      "grad_norm": 0.6227074265480042,
+      "learning_rate": 9.734936084876383e-05,
+      "loss": 0.07016961574554444,
+      "mean_token_accuracy": 0.9725603640079499,
+      "num_tokens": 16641635.0,
+      "step": 810
+    },
+    {
+      "entropy": 0.891328877210617,
+      "epoch": 1.5101289134438307,
+      "grad_norm": 0.7601346969604492,
+      "learning_rate": 9.725064456382283e-05,
+      "loss": 0.07137494087219239,
+      "mean_token_accuracy": 0.9722997546195984,
+      "num_tokens": 16847194.0,
+      "step": 820
+    },
+    {
+      "entropy": 0.8921217978000641,
+      "epoch": 1.5285451197053406,
+      "grad_norm": 0.7813850045204163,
+      "learning_rate": 9.715017543408233e-05,
+      "loss": 0.06890199184417725,
+      "mean_token_accuracy": 0.9735044002532959,
+      "num_tokens": 17052807.0,
+      "step": 830
+    },
+    {
+      "entropy": 0.9085914671421051,
+      "epoch": 1.5469613259668509,
+      "grad_norm": 0.6184289455413818,
+      "learning_rate": 9.704795718661939e-05,
+      "loss": 0.07043765187263488,
+      "mean_token_accuracy": 0.9725716531276702,
+      "num_tokens": 17258284.0,
+      "step": 840
+    },
+    {
+      "entropy": 0.9029861629009247,
+      "epoch": 1.565377532228361,
+      "grad_norm": 0.7082377076148987,
+      "learning_rate": 9.694399361339752e-05,
+      "loss": 0.07113839387893676,
+      "mean_token_accuracy": 0.9725669205188752,
+      "num_tokens": 17464326.0,
+      "step": 850
+    },
+    {
+      "entropy": 0.8856533527374267,
+      "epoch": 1.583793738489871,
+      "grad_norm": 0.7409216165542603,
+      "learning_rate": 9.683828857112627e-05,
+      "loss": 0.07077333331108093,
+      "mean_token_accuracy": 0.9731084644794464,
+      "num_tokens": 17669537.0,
+      "step": 860
+    },
+    {
+      "entropy": 0.8613030433654785,
+      "epoch": 1.6022099447513813,
+      "grad_norm": 0.6801561713218689,
+      "learning_rate": 9.673084598111789e-05,
+      "loss": 0.06885308027267456,
+      "mean_token_accuracy": 0.97266526222229,
+      "num_tokens": 17875289.0,
+      "step": 870
+    },
+    {
+      "entropy": 0.8692965865135193,
+      "epoch": 1.6206261510128912,
+      "grad_norm": 1.1621277332305908,
+      "learning_rate": 9.662166982914203e-05,
+      "loss": 0.07017780542373657,
+      "mean_token_accuracy": 0.9733059942722321,
+      "num_tokens": 18080404.0,
+      "step": 880
+    },
+    {
+      "entropy": 0.8671502113342285,
+      "epoch": 1.6390423572744015,
+      "grad_norm": 0.7518903613090515,
+      "learning_rate": 9.651076416527787e-05,
+      "loss": 0.06977018713951111,
+      "mean_token_accuracy": 0.9730017304420471,
+      "num_tokens": 18285699.0,
+      "step": 890
+    },
+    {
+      "entropy": 0.8662045657634735,
+      "epoch": 1.6574585635359116,
+      "grad_norm": 0.6622698903083801,
+      "learning_rate": 9.639813310376378e-05,
+      "loss": 0.06620995998382569,
+      "mean_token_accuracy": 0.9737491130828857,
+      "num_tokens": 18491097.0,
+      "step": 900
+    },
+    {
+      "entropy": 0.8548173069953918,
+      "epoch": 1.6758747697974217,
+      "grad_norm": 0.8941843509674072,
+      "learning_rate": 9.628378082284479e-05,
+      "loss": 0.06711119413375854,
+      "mean_token_accuracy": 0.9740589797496796,
+      "num_tokens": 18696827.0,
+      "step": 910
+    },
+    {
+      "entropy": 0.8763562262058258,
+      "epoch": 1.694290976058932,
+      "grad_norm": 0.7571700215339661,
+      "learning_rate": 9.616771156461755e-05,
+      "loss": 0.07263468503952027,
+      "mean_token_accuracy": 0.9717419981956482,
+      "num_tokens": 18902513.0,
+      "step": 920
+    },
+    {
+      "entropy": 0.8663733780384064,
+      "epoch": 1.7127071823204418,
+      "grad_norm": 0.7886489629745483,
+      "learning_rate": 9.604992963487298e-05,
+      "loss": 0.07074605226516724,
+      "mean_token_accuracy": 0.9724965393543243,
+      "num_tokens": 19107812.0,
+      "step": 930
+    },
+    {
+      "entropy": 0.8673004627227783,
+      "epoch": 1.7311233885819521,
+      "grad_norm": 0.8180726170539856,
+      "learning_rate": 9.593043940293647e-05,
+      "loss": 0.06831735372543335,
+      "mean_token_accuracy": 0.9733696818351746,
+      "num_tokens": 19313330.0,
+      "step": 940
+    },
+    {
+      "entropy": 0.8525971233844757,
+      "epoch": 1.7495395948434622,
+      "grad_norm": 0.6576228737831116,
+      "learning_rate": 9.580924530150595e-05,
+      "loss": 0.06567002534866333,
+      "mean_token_accuracy": 0.9745754361152649,
+      "num_tokens": 19518671.0,
+      "step": 950
+    },
+    {
+      "entropy": 0.8605451703071594,
+      "epoch": 1.7679558011049723,
+      "grad_norm": 0.7171661257743835,
+      "learning_rate": 9.568635182648725e-05,
+      "loss": 0.06872050762176514,
+      "mean_token_accuracy": 0.9732091546058654,
+      "num_tokens": 19724135.0,
+      "step": 960
+    },
+    {
+      "entropy": 0.8642210960388184,
+      "epoch": 1.7863720073664826,
+      "grad_norm": 0.7603147029876709,
+      "learning_rate": 9.556176353682746e-05,
+      "loss": 0.06766576766967773,
+      "mean_token_accuracy": 0.9728681743144989,
+      "num_tokens": 19928785.0,
+      "step": 970
+    },
+    {
+      "entropy": 0.8543185651302337,
+      "epoch": 1.8047882136279927,
+      "grad_norm": 0.7280875444412231,
+      "learning_rate": 9.543548505434581e-05,
+      "loss": 0.06851862668991089,
+      "mean_token_accuracy": 0.9737437188625335,
+      "num_tokens": 20134195.0,
+      "step": 980
+    },
+    {
+      "entropy": 0.8744745373725891,
+      "epoch": 1.8232044198895028,
+      "grad_norm": 0.5897248983383179,
+      "learning_rate": 9.530752106356209e-05,
+      "loss": 0.06809053421020508,
+      "mean_token_accuracy": 0.9733593761920929,
+      "num_tokens": 20339517.0,
+      "step": 990
+    },
+    {
+      "entropy": 0.8623859465122223,
+      "epoch": 1.8416206261510129,
+      "grad_norm": 0.7515265345573425,
+      "learning_rate": 9.517787631152298e-05,
+      "loss": 0.07257847785949707,
+      "mean_token_accuracy": 0.9714054942131043,
+      "num_tokens": 20545249.0,
+      "step": 1000
+    },
+    {
+      "entropy": 0.8669404804706573,
+      "epoch": 1.860036832412523,
+      "grad_norm": 0.7144560813903809,
+      "learning_rate": 9.504655560762596e-05,
+      "loss": 0.06832354068756104,
+      "mean_token_accuracy": 0.9735779523849487,
+      "num_tokens": 20750507.0,
+      "step": 1010
+    },
+    {
+      "entropy": 0.8493516445159912,
+      "epoch": 1.8784530386740332,
+      "grad_norm": 0.6559189558029175,
+      "learning_rate": 9.491356382344081e-05,
+      "loss": 0.0629766047000885,
+      "mean_token_accuracy": 0.9754977762699127,
+      "num_tokens": 20955956.0,
+      "step": 1020
+    },
+    {
+      "entropy": 0.8599376022815705,
+      "epoch": 1.8968692449355433,
+      "grad_norm": 0.6792973279953003,
+      "learning_rate": 9.477890589252895e-05,
+      "loss": 0.0666757881641388,
+      "mean_token_accuracy": 0.974083811044693,
+      "num_tokens": 21161163.0,
+      "step": 1030
+    },
+    {
+      "entropy": 0.8458438158035279,
+      "epoch": 1.9152854511970534,
+      "grad_norm": 0.6941778659820557,
+      "learning_rate": 9.464258681026042e-05,
+      "loss": 0.06307152509689332,
+      "mean_token_accuracy": 0.9757042229175568,
+      "num_tokens": 21366525.0,
+      "step": 1040
+    },
+    {
+      "entropy": 0.848515909910202,
+      "epoch": 1.9337016574585635,
+      "grad_norm": 0.7307806611061096,
+      "learning_rate": 9.450461163362855e-05,
+      "loss": 0.06307026147842407,
+      "mean_token_accuracy": 0.9750974595546722,
+      "num_tokens": 21572238.0,
+      "step": 1050
+    },
+    {
+      "entropy": 0.8563454031944275,
+      "epoch": 1.9521178637200736,
+      "grad_norm": 0.7222106456756592,
+      "learning_rate": 9.436498548106236e-05,
+      "loss": 0.0647726058959961,
+      "mean_token_accuracy": 0.974629694223404,
+      "num_tokens": 21777633.0,
+      "step": 1060
+    },
+    {
+      "entropy": 0.8656457483768463,
+      "epoch": 1.9705340699815839,
+      "grad_norm": 0.67178875207901,
+      "learning_rate": 9.422371353223674e-05,
+      "loss": 0.06573554277420043,
+      "mean_token_accuracy": 0.9745908617973328,
+      "num_tokens": 21983116.0,
+      "step": 1070
+    },
+    {
+      "entropy": 0.8630891263484954,
+      "epoch": 1.988950276243094,
+      "grad_norm": 0.6956593990325928,
+      "learning_rate": 9.408080102788016e-05,
+      "loss": 0.06630704402923585,
+      "mean_token_accuracy": 0.9741333484649658,
+      "num_tokens": 22188662.0,
+      "step": 1080
+    },
+    {
+      "epoch": 2.0,
+      "eval_entropy": 0.8560857042022373,
+      "eval_loss": 0.06494329869747162,
+      "eval_mean_token_accuracy": 0.9745692672936813,
+      "eval_num_tokens": 22311800.0,
+      "eval_runtime": 10.129,
+      "eval_samples_per_second": 361.142,
+      "eval_steps_per_second": 11.354,
+      "step": 1086
+    },
+    {
+      "entropy": 0.8616272270679474,
+      "epoch": 2.007366482504604,
+      "grad_norm": 0.7778105139732361,
+      "learning_rate": 9.393625326958041e-05,
+      "loss": 0.054407155513763426,
+      "mean_token_accuracy": 0.9792074799537659,
+      "num_tokens": 22394215.0,
+      "step": 1090
+    },
+    {
+      "entropy": 0.8496910452842712,
+      "epoch": 2.0257826887661143,
+      "grad_norm": 0.7422528266906738,
+      "learning_rate": 9.379007561958792e-05,
+      "loss": 0.051881587505340575,
+      "mean_token_accuracy": 0.9799090325832367,
+      "num_tokens": 22599599.0,
+      "step": 1100
+    },
+    {
+      "entropy": 0.8531602442264556,
+      "epoch": 2.044198895027624,
+      "grad_norm": 0.9075332880020142,
+      "learning_rate": 9.36422735006167e-05,
+      "loss": 0.05190724730491638,
+      "mean_token_accuracy": 0.979931116104126,
+      "num_tokens": 22805318.0,
+      "step": 1110
+    },
+    {
+      "entropy": 0.8657277703285218,
+      "epoch": 2.0626151012891345,
+      "grad_norm": 0.9466913938522339,
+      "learning_rate": 9.349285239564325e-05,
+      "loss": 0.053853434324264524,
+      "mean_token_accuracy": 0.9796103596687317,
+      "num_tokens": 23010438.0,
+      "step": 1120
+    },
+    {
+      "entropy": 0.8578485429286957,
+      "epoch": 2.0810313075506444,
+      "grad_norm": 0.6903054714202881,
+      "learning_rate": 9.334181784770326e-05,
+      "loss": 0.05228850841522217,
+      "mean_token_accuracy": 0.9802409887313843,
+      "num_tokens": 23215795.0,
+      "step": 1130
+    },
+    {
+      "entropy": 0.8450767934322357,
+      "epoch": 2.0994475138121547,
+      "grad_norm": 0.6615211367607117,
+      "learning_rate": 9.318917545968581e-05,
+      "loss": 0.050570905208587646,
+      "mean_token_accuracy": 0.9802053451538086,
+      "num_tokens": 23421157.0,
+      "step": 1140
+    },
+    {
+      "entropy": 0.8325044393539429,
+      "epoch": 2.117863720073665,
+      "grad_norm": 0.760960578918457,
+      "learning_rate": 9.303493089412564e-05,
+      "loss": 0.051966112852096555,
+      "mean_token_accuracy": 0.9796205997467041,
+      "num_tokens": 23626584.0,
+      "step": 1150
+    },
+    {
+      "entropy": 0.8416404843330383,
+      "epoch": 2.136279926335175,
+      "grad_norm": 0.6947009563446045,
+      "learning_rate": 9.287908987299306e-05,
+      "loss": 0.05144861936569214,
+      "mean_token_accuracy": 0.9800034642219544,
+      "num_tokens": 23832137.0,
+      "step": 1160
+    },
+    {
+      "entropy": 0.8564540028572083,
+      "epoch": 2.154696132596685,
+      "grad_norm": 0.733252763748169,
+      "learning_rate": 9.272165817748164e-05,
+      "loss": 0.04944799542427063,
+      "mean_token_accuracy": 0.9808157980442047,
+      "num_tokens": 24038006.0,
+      "step": 1170
+    },
+    {
+      "entropy": 0.8575525343418121,
+      "epoch": 2.1731123388581954,
+      "grad_norm": 0.8911028504371643,
+      "learning_rate": 9.25626416477938e-05,
+      "loss": 0.05037952661514282,
+      "mean_token_accuracy": 0.980946284532547,
+      "num_tokens": 24243374.0,
+      "step": 1180
+    },
+    {
+      "entropy": 0.8599720418453216,
+      "epoch": 2.1915285451197053,
+      "grad_norm": 0.7713524103164673,
+      "learning_rate": 9.240204618292416e-05,
+      "loss": 0.050603735446929934,
+      "mean_token_accuracy": 0.980896121263504,
+      "num_tokens": 24448585.0,
+      "step": 1190
+    },
+    {
+      "entropy": 0.8566664934158326,
+      "epoch": 2.2099447513812156,
+      "grad_norm": 0.8439353704452515,
+      "learning_rate": 9.223987774044066e-05,
+      "loss": 0.054171699285507205,
+      "mean_token_accuracy": 0.9796543836593627,
+      "num_tokens": 24653863.0,
+      "step": 1200
+    },
+    {
+      "entropy": 0.846601277589798,
+      "epoch": 2.2283609576427255,
+      "grad_norm": 0.7025637030601501,
+      "learning_rate": 9.207614233626356e-05,
+      "loss": 0.048924127221107484,
+      "mean_token_accuracy": 0.9809681415557862,
+      "num_tokens": 24859801.0,
+      "step": 1210
+    },
+    {
+      "entropy": 0.8564423739910125,
+      "epoch": 2.2467771639042358,
+      "grad_norm": 0.7788274884223938,
+      "learning_rate": 9.191084604444233e-05,
+      "loss": 0.05260283350944519,
+      "mean_token_accuracy": 0.9793797850608825,
+      "num_tokens": 25065368.0,
+      "step": 1220
+    },
+    {
+      "entropy": 0.865056723356247,
+      "epoch": 2.265193370165746,
+      "grad_norm": 0.8728818297386169,
+      "learning_rate": 9.174399499693027e-05,
+      "loss": 0.05016371011734009,
+      "mean_token_accuracy": 0.9807134211063385,
+      "num_tokens": 25270945.0,
+      "step": 1230
+    },
+    {
+      "entropy": 0.8642262935638427,
+      "epoch": 2.283609576427256,
+      "grad_norm": 1.0582489967346191,
+      "learning_rate": 9.157559538335703e-05,
+      "loss": 0.05316779017448425,
+      "mean_token_accuracy": 0.9794209063053131,
+      "num_tokens": 25476575.0,
+      "step": 1240
+    },
+    {
+      "entropy": 0.8677761554718018,
+      "epoch": 2.3020257826887662,
+      "grad_norm": 0.760109543800354,
+      "learning_rate": 9.140565345079901e-05,
+      "loss": 0.05115479230880737,
+      "mean_token_accuracy": 0.9802310705184937,
+      "num_tokens": 25682814.0,
+      "step": 1250
+    },
+    {
+      "entropy": 0.8592945456504821,
+      "epoch": 2.320441988950276,
+      "grad_norm": 0.6537907123565674,
+      "learning_rate": 9.123417550354761e-05,
+      "loss": 0.050543540716171266,
+      "mean_token_accuracy": 0.9806945025920868,
+      "num_tokens": 25887575.0,
+      "step": 1260
+    },
+    {
+      "entropy": 0.8692500293254852,
+      "epoch": 2.3388581952117864,
+      "grad_norm": 0.7771905064582825,
+      "learning_rate": 9.106116790287541e-05,
+      "loss": 0.049718713760375975,
+      "mean_token_accuracy": 0.9805168390274048,
+      "num_tokens": 26092950.0,
+      "step": 1270
+    },
+    {
+      "entropy": 0.8841261565685272,
+      "epoch": 2.3572744014732967,
+      "grad_norm": 0.7791076898574829,
+      "learning_rate": 9.08866370668001e-05,
+      "loss": 0.0527400553226471,
+      "mean_token_accuracy": 0.9796754539012908,
+      "num_tokens": 26298182.0,
+      "step": 1280
+    },
+    {
+      "entropy": 0.8675022900104523,
+      "epoch": 2.3756906077348066,
+      "grad_norm": 0.8481605648994446,
+      "learning_rate": 9.07105894698464e-05,
+      "loss": 0.05320838689804077,
+      "mean_token_accuracy": 0.9792274832725525,
+      "num_tokens": 26503425.0,
+      "step": 1290
+    },
+    {
+      "entropy": 0.8704026222229004,
+      "epoch": 2.394106813996317,
+      "grad_norm": 0.8235505819320679,
+      "learning_rate": 9.053303164280602e-05,
+      "loss": 0.055045205354690555,
+      "mean_token_accuracy": 0.9788750648498535,
+      "num_tokens": 26708755.0,
+      "step": 1300
+    },
+    {
+      "entropy": 0.8525134027004242,
+      "epoch": 2.4125230202578267,
+      "grad_norm": 0.7611598968505859,
+      "learning_rate": 9.035397017249518e-05,
+      "loss": 0.05029621124267578,
+      "mean_token_accuracy": 0.9802757322788238,
+      "num_tokens": 26914704.0,
+      "step": 1310
+    },
+    {
+      "entropy": 0.8630305290222168,
+      "epoch": 2.430939226519337,
+      "grad_norm": 0.790408194065094,
+      "learning_rate": 9.017341170151041e-05,
+      "loss": 0.04856040775775909,
+      "mean_token_accuracy": 0.9809690833091735,
+      "num_tokens": 27120151.0,
+      "step": 1320
+    },
+    {
+      "entropy": 0.8579159140586853,
+      "epoch": 2.4493554327808473,
+      "grad_norm": 0.781972348690033,
+      "learning_rate": 8.999136292798207e-05,
+      "loss": 0.04869682788848877,
+      "mean_token_accuracy": 0.9816130697727203,
+      "num_tokens": 27325673.0,
+      "step": 1330
+    },
+    {
+      "entropy": 0.8634716987609863,
+      "epoch": 2.467771639042357,
+      "grad_norm": 0.8500784039497375,
+      "learning_rate": 8.980783060532588e-05,
+      "loss": 0.05050289034843445,
+      "mean_token_accuracy": 0.980079609155655,
+      "num_tokens": 27531270.0,
+      "step": 1340
+    },
+    {
+      "entropy": 0.8660618126392364,
+      "epoch": 2.4861878453038675,
+      "grad_norm": 0.719760537147522,
+      "learning_rate": 8.96228215419924e-05,
+      "loss": 0.04892141819000244,
+      "mean_token_accuracy": 0.9814020991325378,
+      "num_tokens": 27736542.0,
+      "step": 1350
+    },
+    {
+      "entropy": 0.8572284400463104,
+      "epoch": 2.5046040515653774,
+      "grad_norm": 1.0197229385375977,
+      "learning_rate": 8.943634260121442e-05,
+      "loss": 0.05104702711105347,
+      "mean_token_accuracy": 0.9798846662044525,
+      "num_tokens": 27941566.0,
+      "step": 1360
+    },
+    {
+      "entropy": 0.8702241241931915,
+      "epoch": 2.5230202578268877,
+      "grad_norm": 0.7136003375053406,
+      "learning_rate": 8.924840070075247e-05,
+      "loss": 0.04855787754058838,
+      "mean_token_accuracy": 0.9811685383319855,
+      "num_tokens": 28146943.0,
+      "step": 1370
+    },
+    {
+      "entropy": 0.874957013130188,
+      "epoch": 2.541436464088398,
+      "grad_norm": 0.8775497674942017,
+      "learning_rate": 8.905900281263804e-05,
+      "loss": 0.052434295415878296,
+      "mean_token_accuracy": 0.9795438170433044,
+      "num_tokens": 28352640.0,
+      "step": 1380
+    },
+    {
+      "entropy": 0.8776536166667939,
+      "epoch": 2.559852670349908,
+      "grad_norm": 0.8895741105079651,
+      "learning_rate": 8.8868155962915e-05,
+      "loss": 0.05282890796661377,
+      "mean_token_accuracy": 0.9790538609027862,
+      "num_tokens": 28558153.0,
+      "step": 1390
+    },
+    {
+      "entropy": 0.8738743245601654,
+      "epoch": 2.578268876611418,
+      "grad_norm": 0.788800060749054,
+      "learning_rate": 8.867586723137906e-05,
+      "loss": 0.048841872811317445,
+      "mean_token_accuracy": 0.9809149026870727,
+      "num_tokens": 28763613.0,
+      "step": 1400
+    },
+    {
+      "entropy": 0.8750253796577454,
+      "epoch": 2.596685082872928,
+      "grad_norm": 0.8738002777099609,
+      "learning_rate": 8.848214375131497e-05,
+      "loss": 0.048261132836341855,
+      "mean_token_accuracy": 0.980789190530777,
+      "num_tokens": 28969248.0,
+      "step": 1410
+    },
+    {
+      "entropy": 0.8624245524406433,
+      "epoch": 2.6151012891344383,
+      "grad_norm": 0.6404895186424255,
+      "learning_rate": 8.828699270923196e-05,
+      "loss": 0.04970468282699585,
+      "mean_token_accuracy": 0.9807762265205383,
+      "num_tokens": 29174779.0,
+      "step": 1420
+    },
+    {
+      "entropy": 0.8792938470840455,
+      "epoch": 2.6335174953959486,
+      "grad_norm": 0.7856965661048889,
+      "learning_rate": 8.80904213445972e-05,
+      "loss": 0.053334391117095946,
+      "mean_token_accuracy": 0.9790222108364105,
+      "num_tokens": 29380474.0,
+      "step": 1430
+    },
+    {
+      "entropy": 0.8831034600734711,
+      "epoch": 2.6519337016574585,
+      "grad_norm": 0.7739618420600891,
+      "learning_rate": 8.789243694956716e-05,
+      "loss": 0.04959054589271546,
+      "mean_token_accuracy": 0.9803965091705322,
+      "num_tokens": 29585985.0,
+      "step": 1440
+    },
+    {
+      "entropy": 0.8934672951698304,
+      "epoch": 2.6703499079189688,
+      "grad_norm": 0.6999697089195251,
+      "learning_rate": 8.769304686871719e-05,
+      "loss": 0.05165250301361084,
+      "mean_token_accuracy": 0.9798884153366089,
+      "num_tokens": 29791238.0,
+      "step": 1450
+    },
+    {
+      "entropy": 0.9053199410438537,
+      "epoch": 2.6887661141804786,
+      "grad_norm": 0.9199564456939697,
+      "learning_rate": 8.749225849876892e-05,
+      "loss": 0.04924143850803375,
+      "mean_token_accuracy": 0.9810785710811615,
+      "num_tokens": 29996589.0,
+      "step": 1460
+    },
+    {
+      "entropy": 0.888091403245926,
+      "epoch": 2.707182320441989,
+      "grad_norm": 0.7480106353759766,
+      "learning_rate": 8.729007928831597e-05,
+      "loss": 0.04948916733264923,
+      "mean_token_accuracy": 0.9809579730033875,
+      "num_tokens": 30201875.0,
+      "step": 1470
+    },
+    {
+      "entropy": 0.8723407983779907,
+      "epoch": 2.7255985267034992,
+      "grad_norm": 0.9506945013999939,
+      "learning_rate": 8.708651673754763e-05,
+      "loss": 0.048927539587020875,
+      "mean_token_accuracy": 0.980553150177002,
+      "num_tokens": 30407550.0,
+      "step": 1480
+    },
+    {
+      "entropy": 0.8737521529197693,
+      "epoch": 2.744014732965009,
+      "grad_norm": 0.8015706539154053,
+      "learning_rate": 8.688157839797062e-05,
+      "loss": 0.04963063597679138,
+      "mean_token_accuracy": 0.9809738755226135,
+      "num_tokens": 30612839.0,
+      "step": 1490
+    },
+    {
+      "entropy": 0.8800762951374054,
+      "epoch": 2.7624309392265194,
+      "grad_norm": 0.9429986476898193,
+      "learning_rate": 8.667527187212885e-05,
+      "loss": 0.0524174690246582,
+      "mean_token_accuracy": 0.9788767337799072,
+      "num_tokens": 30818578.0,
+      "step": 1500
+    },
+    {
+      "entropy": 0.8871055901050567,
+      "epoch": 2.7808471454880292,
+      "grad_norm": 0.5909196138381958,
+      "learning_rate": 8.646760481332157e-05,
+      "loss": 0.05166680812835693,
+      "mean_token_accuracy": 0.980216771364212,
+      "num_tokens": 31023829.0,
+      "step": 1510
+    },
+    {
+      "entropy": 0.8908755779266357,
+      "epoch": 2.7992633517495396,
+      "grad_norm": 0.9154611229896545,
+      "learning_rate": 8.625858492531931e-05,
+      "loss": 0.04951836466789246,
+      "mean_token_accuracy": 0.9801484227180481,
+      "num_tokens": 31229635.0,
+      "step": 1520
+    },
+    {
+      "entropy": 0.92480548620224,
+      "epoch": 2.81767955801105,
+      "grad_norm": 0.5989938378334045,
+      "learning_rate": 8.604821996207819e-05,
+      "loss": 0.04799881279468536,
+      "mean_token_accuracy": 0.9817522585391998,
+      "num_tokens": 31435456.0,
+      "step": 1530
+    },
+    {
+      "entropy": 0.9173881888389588,
+      "epoch": 2.8360957642725597,
+      "grad_norm": 0.899413526058197,
+      "learning_rate": 8.58365177274522e-05,
+      "loss": 0.0487445592880249,
+      "mean_token_accuracy": 0.9812625288963318,
+      "num_tokens": 31640904.0,
+      "step": 1540
+    },
+    {
+      "entropy": 0.9076135993003845,
+      "epoch": 2.85451197053407,
+      "grad_norm": 0.8494166135787964,
+      "learning_rate": 8.562348607490376e-05,
+      "loss": 0.05005228519439697,
+      "mean_token_accuracy": 0.9806681036949157,
+      "num_tokens": 31845807.0,
+      "step": 1550
+    },
+    {
+      "entropy": 0.9092245221138,
+      "epoch": 2.87292817679558,
+      "grad_norm": 0.8225123286247253,
+      "learning_rate": 8.540913290721234e-05,
+      "loss": 0.048654764890670776,
+      "mean_token_accuracy": 0.9805659353733063,
+      "num_tokens": 32051523.0,
+      "step": 1560
+    },
+    {
+      "entropy": 0.9062779664993286,
+      "epoch": 2.89134438305709,
+      "grad_norm": 0.7074014544487,
+      "learning_rate": 8.519346617618134e-05,
+      "loss": 0.049209845066070554,
+      "mean_token_accuracy": 0.9807434439659118,
+      "num_tokens": 32256895.0,
+      "step": 1570
+    },
+    {
+      "entropy": 0.9190246641635895,
+      "epoch": 2.9097605893186005,
+      "grad_norm": 0.8860642910003662,
+      "learning_rate": 8.497649388234304e-05,
+      "loss": 0.051211881637573245,
+      "mean_token_accuracy": 0.9802342295646668,
+      "num_tokens": 32462031.0,
+      "step": 1580
+    },
+    {
+      "entropy": 0.9088015079498291,
+      "epoch": 2.9281767955801103,
+      "grad_norm": 0.8062726855278015,
+      "learning_rate": 8.475822407466188e-05,
+      "loss": 0.053512704372406,
+      "mean_token_accuracy": 0.979486483335495,
+      "num_tokens": 32667533.0,
+      "step": 1590
+    },
+    {
+      "entropy": 0.9462027847766876,
+      "epoch": 2.9465930018416207,
+      "grad_norm": 0.7962909936904907,
+      "learning_rate": 8.453866485023579e-05,
+      "loss": 0.0501457154750824,
+      "mean_token_accuracy": 0.9803222417831421,
+      "num_tokens": 32872900.0,
+      "step": 1600
+    },
+    {
+      "entropy": 0.9671471297740937,
+      "epoch": 2.9650092081031305,
+      "grad_norm": 0.7641744017601013,
+      "learning_rate": 8.431782435399587e-05,
+      "loss": 0.04629061222076416,
+      "mean_token_accuracy": 0.9823175370693207,
+      "num_tokens": 33077850.0,
+      "step": 1610
+    },
+    {
+      "entropy": 0.955865204334259,
+      "epoch": 2.983425414364641,
+      "grad_norm": 0.6772348880767822,
+      "learning_rate": 8.409571077840426e-05,
+      "loss": 0.048368623852729796,
+      "mean_token_accuracy": 0.9808700799942016,
+      "num_tokens": 33283117.0,
+      "step": 1620
+    },
+    {
+      "epoch": 3.0,
+      "eval_entropy": 0.9563225186389426,
+      "eval_loss": 0.059064481407403946,
+      "eval_mean_token_accuracy": 0.9773589429648026,
+      "eval_num_tokens": 33467712.0,
+      "eval_runtime": 10.1471,
+      "eval_samples_per_second": 360.499,
+      "eval_steps_per_second": 11.333,
+      "step": 1629
+    },
+    {
+      "entropy": 0.9337226033210755,
+      "epoch": 3.001841620626151,
+      "grad_norm": 0.646203875541687,
+      "learning_rate": 8.387233236315016e-05,
+      "loss": 0.043352216482162476,
+      "mean_token_accuracy": 0.9830620110034942,
+      "num_tokens": 33488302.0,
+      "step": 1630
+    },
+    {
+      "entropy": 0.9734923839569092,
+      "epoch": 3.020257826887661,
+      "grad_norm": 0.7564226984977722,
+      "learning_rate": 8.364769739484416e-05,
+      "loss": 0.033932483196258544,
+      "mean_token_accuracy": 0.9872806966304779,
+      "num_tokens": 33693531.0,
+      "step": 1640
+    },
+    {
+      "entropy": 0.9669206500053406,
+      "epoch": 3.0386740331491713,
+      "grad_norm": 0.7126886248588562,
+      "learning_rate": 8.342181420671096e-05,
+      "loss": 0.03818287253379822,
+      "mean_token_accuracy": 0.9852082908153534,
+      "num_tokens": 33899305.0,
+      "step": 1650
+    },
+    {
+      "entropy": 0.9522916138172149,
+      "epoch": 3.0570902394106816,
+      "grad_norm": 1.0571653842926025,
+      "learning_rate": 8.319469117828007e-05,
+      "loss": 0.03456039130687714,
+      "mean_token_accuracy": 0.9867027878761292,
+      "num_tokens": 34104585.0,
+      "step": 1660
+    },
+    {
+      "entropy": 0.9568560004234314,
+      "epoch": 3.0755064456721914,
+      "grad_norm": 0.780940592288971,
+      "learning_rate": 8.296633673507505e-05,
+      "loss": 0.03551802039146423,
+      "mean_token_accuracy": 0.9867531359195709,
+      "num_tokens": 34309516.0,
+      "step": 1670
+    },
+    {
+      "entropy": 0.9590656876564025,
+      "epoch": 3.0939226519337018,
+      "grad_norm": 0.8330219388008118,
+      "learning_rate": 8.273675934830094e-05,
+      "loss": 0.03674865961074829,
+      "mean_token_accuracy": 0.9864118576049805,
+      "num_tokens": 34515170.0,
+      "step": 1680
+    },
+    {
+      "entropy": 0.975881814956665,
+      "epoch": 3.1123388581952116,
+      "grad_norm": 0.7010637521743774,
+      "learning_rate": 8.250596753453e-05,
+      "loss": 0.03550414443016052,
+      "mean_token_accuracy": 0.9864102602005005,
+      "num_tokens": 34720896.0,
+      "step": 1690
+    },
+    {
+      "entropy": 0.9599562883377075,
+      "epoch": 3.130755064456722,
+      "grad_norm": 0.6694278717041016,
+      "learning_rate": 8.227396985538578e-05,
+      "loss": 0.035564273595809937,
+      "mean_token_accuracy": 0.9867321848869324,
+      "num_tokens": 34925970.0,
+      "step": 1700
+    },
+    {
+      "entropy": 0.9582216143608093,
+      "epoch": 3.149171270718232,
+      "grad_norm": 0.9333199262619019,
+      "learning_rate": 8.204077491722546e-05,
+      "loss": 0.035575729608535764,
+      "mean_token_accuracy": 0.9862452208995819,
+      "num_tokens": 35131543.0,
+      "step": 1710
+    },
+    {
+      "entropy": 0.9579678058624268,
+      "epoch": 3.167587476979742,
+      "grad_norm": 0.9450218081474304,
+      "learning_rate": 8.180639137082066e-05,
+      "loss": 0.0385298490524292,
+      "mean_token_accuracy": 0.98538036942482,
+      "num_tokens": 35336790.0,
+      "step": 1720
+    },
+    {
+      "entropy": 0.9640831351280212,
+      "epoch": 3.1860036832412524,
+      "grad_norm": 0.8551534414291382,
+      "learning_rate": 8.157082791103649e-05,
+      "loss": 0.03702138364315033,
+      "mean_token_accuracy": 0.9852015495300293,
+      "num_tokens": 35542294.0,
+      "step": 1730
+    },
+    {
+      "entropy": 0.9867071211338043,
+      "epoch": 3.2044198895027622,
+      "grad_norm": 0.7138128876686096,
+      "learning_rate": 8.133409327650897e-05,
+      "loss": 0.035626694560050964,
+      "mean_token_accuracy": 0.986064875125885,
+      "num_tokens": 35747447.0,
+      "step": 1740
+    },
+    {
+      "entropy": 0.9639089345932007,
+      "epoch": 3.2228360957642725,
+      "grad_norm": 0.7131415009498596,
+      "learning_rate": 8.109619624932092e-05,
+      "loss": 0.035885071754455565,
+      "mean_token_accuracy": 0.986273056268692,
+      "num_tokens": 35952258.0,
+      "step": 1750
+    },
+    {
+      "entropy": 0.9516046345233917,
+      "epoch": 3.241252302025783,
+      "grad_norm": 0.6900200843811035,
+      "learning_rate": 8.085714565467611e-05,
+      "loss": 0.03535219430923462,
+      "mean_token_accuracy": 0.985836285352707,
+      "num_tokens": 36157938.0,
+      "step": 1760
+    },
+    {
+      "entropy": 0.9373646557331086,
+      "epoch": 3.2596685082872927,
+      "grad_norm": 0.6101690530776978,
+      "learning_rate": 8.061695036057191e-05,
+      "loss": 0.034940996766090394,
+      "mean_token_accuracy": 0.9863743901252746,
+      "num_tokens": 36363825.0,
+      "step": 1770
+    },
+    {
+      "entropy": 0.9444344758987426,
+      "epoch": 3.278084714548803,
+      "grad_norm": 0.7518529295921326,
+      "learning_rate": 8.03756192774703e-05,
+      "loss": 0.03404279053211212,
+      "mean_token_accuracy": 0.9866396844387054,
+      "num_tokens": 36568961.0,
+      "step": 1780
+    },
+    {
+      "entropy": 0.9550357758998871,
+      "epoch": 3.2965009208103133,
+      "grad_norm": 0.7687555551528931,
+      "learning_rate": 8.013316135796734e-05,
+      "loss": 0.038447052240371704,
+      "mean_token_accuracy": 0.985325163602829,
+      "num_tokens": 36774514.0,
+      "step": 1790
+    },
+    {
+      "entropy": 0.9477231681346894,
+      "epoch": 3.314917127071823,
+      "grad_norm": 0.7521633505821228,
+      "learning_rate": 7.988958559646102e-05,
+      "loss": 0.03746694028377533,
+      "mean_token_accuracy": 0.9853165090084076,
+      "num_tokens": 36979660.0,
+      "step": 1800
+    },
+    {
+      "entropy": 0.925805002450943,
+      "epoch": 3.3333333333333335,
+      "grad_norm": 0.9333297610282898,
+      "learning_rate": 7.964490102881768e-05,
+      "loss": 0.03700103759765625,
+      "mean_token_accuracy": 0.9850880861282348,
+      "num_tokens": 37185191.0,
+      "step": 1810
+    },
+    {
+      "entropy": 0.9225482225418091,
+      "epoch": 3.3517495395948433,
+      "grad_norm": 0.7928622961044312,
+      "learning_rate": 7.939911673203665e-05,
+      "loss": 0.03825801610946655,
+      "mean_token_accuracy": 0.9850241422653199,
+      "num_tokens": 37390749.0,
+      "step": 1820
+    },
+    {
+      "entropy": 0.9597147881984711,
+      "epoch": 3.3701657458563536,
+      "grad_norm": 0.7658583521842957,
+      "learning_rate": 7.915224182391375e-05,
+      "loss": 0.039855146408081056,
+      "mean_token_accuracy": 0.9845879554748536,
+      "num_tokens": 37596052.0,
+      "step": 1830
+    },
+    {
+      "entropy": 0.9485619068145752,
+      "epoch": 3.388581952117864,
+      "grad_norm": 0.8492130637168884,
+      "learning_rate": 7.890428546270278e-05,
+      "loss": 0.039359599351882935,
+      "mean_token_accuracy": 0.9847265422344208,
+      "num_tokens": 37802063.0,
+      "step": 1840
+    },
+    {
+      "entropy": 0.9670301914215088,
+      "epoch": 3.406998158379374,
+      "grad_norm": 0.7527599930763245,
+      "learning_rate": 7.865525684677608e-05,
+      "loss": 0.03752985596656799,
+      "mean_token_accuracy": 0.9855137526988983,
+      "num_tokens": 38007432.0,
+      "step": 1850
+    },
+    {
+      "entropy": 0.9681244969367981,
+      "epoch": 3.425414364640884,
+      "grad_norm": 0.7599612474441528,
+      "learning_rate": 7.840516521428303e-05,
+      "loss": 0.03653894364833832,
+      "mean_token_accuracy": 0.9858933389186859,
+      "num_tokens": 38212923.0,
+      "step": 1860
+    },
+    {
+      "entropy": 0.9706049561500549,
+      "epoch": 3.443830570902394,
+      "grad_norm": 0.7678127884864807,
+      "learning_rate": 7.815401984280748e-05,
+      "loss": 0.0366938978433609,
+      "mean_token_accuracy": 0.9854713797569274,
+      "num_tokens": 38418422.0,
+      "step": 1870
+    },
+    {
+      "entropy": 0.9637093842029572,
+      "epoch": 3.4622467771639043,
+      "grad_norm": 0.762824535369873,
+      "learning_rate": 7.790183004902359e-05,
+      "loss": 0.03516915142536163,
+      "mean_token_accuracy": 0.9866003453731537,
+      "num_tokens": 38624389.0,
+      "step": 1880
+    },
+    {
+      "entropy": 0.9373565018177032,
+      "epoch": 3.4806629834254146,
+      "grad_norm": 0.8221780061721802,
+      "learning_rate": 7.764860518835014e-05,
+      "loss": 0.04049026966094971,
+      "mean_token_accuracy": 0.984089481830597,
+      "num_tokens": 38829654.0,
+      "step": 1890
+    },
+    {
+      "entropy": 0.9356025457382202,
+      "epoch": 3.4990791896869244,
+      "grad_norm": 0.7583426237106323,
+      "learning_rate": 7.739435465460356e-05,
+      "loss": 0.03658481240272522,
+      "mean_token_accuracy": 0.9857318818569183,
+      "num_tokens": 39034638.0,
+      "step": 1900
+    },
+    {
+      "entropy": 0.9740163326263428,
+      "epoch": 3.5174953959484347,
+      "grad_norm": 0.7332878112792969,
+      "learning_rate": 7.713908787964937e-05,
+      "loss": 0.03508963882923126,
+      "mean_token_accuracy": 0.9863419532775879,
+      "num_tokens": 39240265.0,
+      "step": 1910
+    },
+    {
+      "entropy": 0.9528286933898926,
+      "epoch": 3.5359116022099446,
+      "grad_norm": 0.6515451669692993,
+      "learning_rate": 7.688281433305233e-05,
+      "loss": 0.036055779457092284,
+      "mean_token_accuracy": 0.9860979080200195,
+      "num_tokens": 39445546.0,
+      "step": 1920
+    },
+    {
+      "entropy": 0.9480705261230469,
+      "epoch": 3.554327808471455,
+      "grad_norm": 0.7725827097892761,
+      "learning_rate": 7.662554352172515e-05,
+      "loss": 0.037101513147354125,
+      "mean_token_accuracy": 0.985782790184021,
+      "num_tokens": 39651078.0,
+      "step": 1930
+    },
+    {
+      "entropy": 0.9655321061611175,
+      "epoch": 3.572744014732965,
+      "grad_norm": 0.7756506204605103,
+      "learning_rate": 7.636728498957581e-05,
+      "loss": 0.03721855878829956,
+      "mean_token_accuracy": 0.9857951939105988,
+      "num_tokens": 39856542.0,
+      "step": 1940
+    },
+    {
+      "entropy": 0.9772682309150695,
+      "epoch": 3.591160220994475,
+      "grad_norm": 0.9084987640380859,
+      "learning_rate": 7.610804831715355e-05,
+      "loss": 0.03570749163627625,
+      "mean_token_accuracy": 0.9863450109958649,
+      "num_tokens": 40061913.0,
+      "step": 1950
+    },
+    {
+      "entropy": 0.9579685389995575,
+      "epoch": 3.6095764272559854,
+      "grad_norm": 0.6358487606048584,
+      "learning_rate": 7.584784312129334e-05,
+      "loss": 0.038210684061050416,
+      "mean_token_accuracy": 0.9850837290287018,
+      "num_tokens": 40267398.0,
+      "step": 1960
+    },
+    {
+      "entropy": 0.9605201721191406,
+      "epoch": 3.6279926335174952,
+      "grad_norm": 0.6263149976730347,
+      "learning_rate": 7.558667905475927e-05,
+      "loss": 0.03509160876274109,
+      "mean_token_accuracy": 0.9868143379688263,
+      "num_tokens": 40472827.0,
+      "step": 1970
+    },
+    {
+      "entropy": 0.964026153087616,
+      "epoch": 3.6464088397790055,
+      "grad_norm": 0.90068119764328,
+      "learning_rate": 7.532456580588638e-05,
+      "loss": 0.036211782693862916,
+      "mean_token_accuracy": 0.9858468770980835,
+      "num_tokens": 40677935.0,
+      "step": 1980
+    },
+    {
+      "entropy": 0.9494135618209839,
+      "epoch": 3.664825046040516,
+      "grad_norm": 0.760134756565094,
+      "learning_rate": 7.50615130982213e-05,
+      "loss": 0.03786201477050781,
+      "mean_token_accuracy": 0.9852500438690186,
+      "num_tokens": 40883750.0,
+      "step": 1990
+    },
+    {
+      "entropy": 0.9527071297168732,
+      "epoch": 3.6832412523020257,
+      "grad_norm": 0.9812107682228088,
+      "learning_rate": 7.479753069016152e-05,
+      "loss": 0.03803159594535828,
+      "mean_token_accuracy": 0.9852405369281769,
+      "num_tokens": 41089115.0,
+      "step": 2000
+    },
+    {
+      "entropy": 0.9639330863952636,
+      "epoch": 3.701657458563536,
+      "grad_norm": 0.7164933681488037,
+      "learning_rate": 7.453262837459332e-05,
+      "loss": 0.03912568986415863,
+      "mean_token_accuracy": 0.9849458575248718,
+      "num_tokens": 41294694.0,
+      "step": 2010
+    },
+    {
+      "entropy": 0.9536987483501435,
+      "epoch": 3.720073664825046,
+      "grad_norm": 0.6804596185684204,
+      "learning_rate": 7.426681597852863e-05,
+      "loss": 0.036410006880760196,
+      "mean_token_accuracy": 0.985712206363678,
+      "num_tokens": 41499817.0,
+      "step": 2020
+    },
+    {
+      "entropy": 0.9478164672851562,
+      "epoch": 3.738489871086556,
+      "grad_norm": 0.8799397349357605,
+      "learning_rate": 7.400010336274037e-05,
+      "loss": 0.03801035583019256,
+      "mean_token_accuracy": 0.9850274682044983,
+      "num_tokens": 41704932.0,
+      "step": 2030
+    },
+    {
+      "entropy": 0.9383447647094727,
+      "epoch": 3.7569060773480665,
+      "grad_norm": 0.8386216163635254,
+      "learning_rate": 7.373250042139664e-05,
+      "loss": 0.0373637855052948,
+      "mean_token_accuracy": 0.9854822158813477,
+      "num_tokens": 41910804.0,
+      "step": 2040
+    },
+    {
+      "entropy": 0.925172996520996,
+      "epoch": 3.7753222836095763,
+      "grad_norm": 0.7599324584007263,
+      "learning_rate": 7.346401708169377e-05,
+      "loss": 0.03585260808467865,
+      "mean_token_accuracy": 0.9860672950744629,
+      "num_tokens": 42116706.0,
+      "step": 2050
+    },
+    {
+      "entropy": 0.9463765442371368,
+      "epoch": 3.7937384898710866,
+      "grad_norm": 0.9030149579048157,
+      "learning_rate": 7.319466330348797e-05,
+      "loss": 0.035877206921577455,
+      "mean_token_accuracy": 0.9863968968391419,
+      "num_tokens": 42322670.0,
+      "step": 2060
+    },
+    {
+      "entropy": 0.9942441761493683,
+      "epoch": 3.8121546961325965,
+      "grad_norm": 0.6400449275970459,
+      "learning_rate": 7.292444907892587e-05,
+      "loss": 0.037310433387756345,
+      "mean_token_accuracy": 0.9854151606559753,
+      "num_tokens": 42527752.0,
+      "step": 2070
+    },
+    {
+      "entropy": 0.9577703952789307,
+      "epoch": 3.830570902394107,
+      "grad_norm": 0.6193167567253113,
+      "learning_rate": 7.265338443207387e-05,
+      "loss": 0.03648848831653595,
+      "mean_token_accuracy": 0.9856530070304871,
+      "num_tokens": 42732981.0,
+      "step": 2080
+    },
+    {
+      "entropy": 0.9663952767848969,
+      "epoch": 3.848987108655617,
+      "grad_norm": 0.759611189365387,
+      "learning_rate": 7.238147941854625e-05,
+      "loss": 0.036112996935844424,
+      "mean_token_accuracy": 0.9862765550613404,
+      "num_tokens": 42938619.0,
+      "step": 2090
+    },
+    {
+      "entropy": 0.9484863519668579,
+      "epoch": 3.867403314917127,
+      "grad_norm": 0.7420705556869507,
+      "learning_rate": 7.210874412513218e-05,
+      "loss": 0.03703283965587616,
+      "mean_token_accuracy": 0.9857317566871643,
+      "num_tokens": 43143753.0,
+      "step": 2100
+    },
+    {
+      "entropy": 0.964326673746109,
+      "epoch": 3.8858195211786373,
+      "grad_norm": 0.8779639601707458,
+      "learning_rate": 7.183518866942147e-05,
+      "loss": 0.03739701807498932,
+      "mean_token_accuracy": 0.9852154791355133,
+      "num_tokens": 43349451.0,
+      "step": 2110
+    },
+    {
+      "entropy": 0.9729791641235351,
+      "epoch": 3.904235727440147,
+      "grad_norm": 0.7582741379737854,
+      "learning_rate": 7.156082319942929e-05,
+      "loss": 0.03894525766372681,
+      "mean_token_accuracy": 0.9847454309463501,
+      "num_tokens": 43554598.0,
+      "step": 2120
+    },
+    {
+      "entropy": 0.9860592544078827,
+      "epoch": 3.9226519337016574,
+      "grad_norm": 0.860698938369751,
+      "learning_rate": 7.128565789321969e-05,
+      "loss": 0.0365300178527832,
+      "mean_token_accuracy": 0.9859121859073638,
+      "num_tokens": 43760081.0,
+      "step": 2130
+    },
+    {
+      "entropy": 0.9916551172733307,
+      "epoch": 3.9410681399631677,
+      "grad_norm": 0.8363776206970215,
+      "learning_rate": 7.100970295852805e-05,
+      "loss": 0.036221379041671754,
+      "mean_token_accuracy": 0.9859034180641174,
+      "num_tokens": 43965432.0,
+      "step": 2140
+    },
+    {
+      "entropy": 0.9553558886051178,
+      "epoch": 3.9594843462246776,
+      "grad_norm": 0.9627474546432495,
+      "learning_rate": 7.073296863238242e-05,
+      "loss": 0.03684481382369995,
+      "mean_token_accuracy": 0.9857315957546234,
+      "num_tokens": 44171232.0,
+      "step": 2150
+    },
+    {
+      "entropy": 0.9538035809993743,
+      "epoch": 3.977900552486188,
+      "grad_norm": 0.8399474620819092,
+      "learning_rate": 7.045546518072366e-05,
+      "loss": 0.03825397789478302,
+      "mean_token_accuracy": 0.9846831560134888,
+      "num_tokens": 44376723.0,
+      "step": 2160
+    },
+    {
+      "entropy": 0.9476235210895538,
+      "epoch": 3.9963167587476978,
+      "grad_norm": 0.708739697933197,
+      "learning_rate": 7.017720289802472e-05,
+      "loss": 0.03618018329143524,
+      "mean_token_accuracy": 0.9861325800418854,
+      "num_tokens": 44582407.0,
+      "step": 2170
+    },
+    {
+      "epoch": 4.0,
+      "eval_entropy": 0.9569619194321011,
+      "eval_loss": 0.059838198125362396,
+      "eval_mean_token_accuracy": 0.9777795366618944,
+      "eval_num_tokens": 44623647.0,
+      "eval_runtime": 10.0379,
+      "eval_samples_per_second": 364.42,
+      "eval_steps_per_second": 11.457,
+      "step": 2172
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 5430,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 10,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 2.12729708313523e+18,
+  "train_batch_size": 32,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-2172/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21325c9bdff5ed34f0cc34837ee67ed216c9301ab4d9b2e26f048b563564bd75
+size 5777

checkpoint-2715/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-2715/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-2715/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65a66fa5ef9ed41e342eac55fca7f83744379f75f4d29b573d2790ba504c1659
+size 80792096

checkpoint-2715/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-2715/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

checkpoint-2715/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

checkpoint-2715/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2799 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 5.0,
+  "eval_steps": 500,
+  "global_step": 2715,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "entropy": 1.2237394809722901,
+      "epoch": 0.01841620626151013,
+      "grad_norm": 5.082435607910156,
+      "learning_rate": 3.308823529411765e-06,
+      "loss": 0.9237876892089844,
+      "mean_token_accuracy": 0.7685343027114868,
+      "num_tokens": 205423.0,
+      "step": 10
+    },
+    {
+      "entropy": 1.2295925617218018,
+      "epoch": 0.03683241252302026,
+      "grad_norm": 4.672000408172607,
+      "learning_rate": 6.985294117647059e-06,
+      "loss": 0.8900892257690429,
+      "mean_token_accuracy": 0.7677771031856537,
+      "num_tokens": 410849.0,
+      "step": 20
+    },
+    {
+      "entropy": 1.2285718679428101,
+      "epoch": 0.055248618784530384,
+      "grad_norm": 1.4828118085861206,
+      "learning_rate": 1.0661764705882354e-05,
+      "loss": 0.5975452899932862,
+      "mean_token_accuracy": 0.8146551787853241,
+      "num_tokens": 616438.0,
+      "step": 30
+    },
+    {
+      "entropy": 1.210776400566101,
+      "epoch": 0.07366482504604052,
+      "grad_norm": 0.7761328816413879,
+      "learning_rate": 1.4338235294117647e-05,
+      "loss": 0.40664992332458494,
+      "mean_token_accuracy": 0.8699092030525207,
+      "num_tokens": 822118.0,
+      "step": 40
+    },
+    {
+      "entropy": 1.200321125984192,
+      "epoch": 0.09208103130755065,
+      "grad_norm": 0.5363371968269348,
+      "learning_rate": 1.8014705882352943e-05,
+      "loss": 0.3313469409942627,
+      "mean_token_accuracy": 0.8904915869235992,
+      "num_tokens": 1027941.0,
+      "step": 50
+    },
+    {
+      "entropy": 1.1809936046600342,
+      "epoch": 0.11049723756906077,
+      "grad_norm": 0.39541518688201904,
+      "learning_rate": 2.1691176470588237e-05,
+      "loss": 0.27568228244781495,
+      "mean_token_accuracy": 0.9047131836414337,
+      "num_tokens": 1233620.0,
+      "step": 60
+    },
+    {
+      "entropy": 1.169810914993286,
+      "epoch": 0.1289134438305709,
+      "grad_norm": 0.341960072517395,
+      "learning_rate": 2.536764705882353e-05,
+      "loss": 0.245219087600708,
+      "mean_token_accuracy": 0.9150686681270599,
+      "num_tokens": 1438656.0,
+      "step": 70
+    },
+    {
+      "entropy": 1.1652960777282715,
+      "epoch": 0.14732965009208104,
+      "grad_norm": 0.36872178316116333,
+      "learning_rate": 2.9044117647058828e-05,
+      "loss": 0.2220149040222168,
+      "mean_token_accuracy": 0.9224777698516846,
+      "num_tokens": 1643877.0,
+      "step": 80
+    },
+    {
+      "entropy": 1.154341197013855,
+      "epoch": 0.16574585635359115,
+      "grad_norm": 0.4152425229549408,
+      "learning_rate": 3.272058823529412e-05,
+      "loss": 0.2002798557281494,
+      "mean_token_accuracy": 0.9285802960395813,
+      "num_tokens": 1849506.0,
+      "step": 90
+    },
+    {
+      "entropy": 1.1507258892059327,
+      "epoch": 0.1841620626151013,
+      "grad_norm": 0.47647765278816223,
+      "learning_rate": 3.639705882352941e-05,
+      "loss": 0.18871363401412963,
+      "mean_token_accuracy": 0.9318056285381318,
+      "num_tokens": 2055071.0,
+      "step": 100
+    },
+    {
+      "entropy": 1.1455535531044005,
+      "epoch": 0.20257826887661143,
+      "grad_norm": 0.4853009581565857,
+      "learning_rate": 4.007352941176471e-05,
+      "loss": 0.17836341857910157,
+      "mean_token_accuracy": 0.9367631554603577,
+      "num_tokens": 2260643.0,
+      "step": 110
+    },
+    {
+      "entropy": 1.1402526497840881,
+      "epoch": 0.22099447513812154,
+      "grad_norm": 0.4455392360687256,
+      "learning_rate": 4.375e-05,
+      "loss": 0.16921783685684205,
+      "mean_token_accuracy": 0.9386959195137023,
+      "num_tokens": 2466085.0,
+      "step": 120
+    },
+    {
+      "entropy": 1.1374777555465698,
+      "epoch": 0.23941068139963168,
+      "grad_norm": 0.5880279541015625,
+      "learning_rate": 4.742647058823529e-05,
+      "loss": 0.15989291667938232,
+      "mean_token_accuracy": 0.9421182632446289,
+      "num_tokens": 2671024.0,
+      "step": 130
+    },
+    {
+      "entropy": 1.1273940205574036,
+      "epoch": 0.2578268876611418,
+      "grad_norm": 0.612959086894989,
+      "learning_rate": 5.110294117647059e-05,
+      "loss": 0.14701461791992188,
+      "mean_token_accuracy": 0.9463540315628052,
+      "num_tokens": 2876848.0,
+      "step": 140
+    },
+    {
+      "entropy": 1.1263513088226318,
+      "epoch": 0.27624309392265195,
+      "grad_norm": 0.5695255398750305,
+      "learning_rate": 5.477941176470589e-05,
+      "loss": 0.14604382514953612,
+      "mean_token_accuracy": 0.946351945400238,
+      "num_tokens": 3082589.0,
+      "step": 150
+    },
+    {
+      "entropy": 1.1290789365768432,
+      "epoch": 0.2946593001841621,
+      "grad_norm": 0.6608090996742249,
+      "learning_rate": 5.845588235294118e-05,
+      "loss": 0.1409450054168701,
+      "mean_token_accuracy": 0.9481450319290161,
+      "num_tokens": 3287459.0,
+      "step": 160
+    },
+    {
+      "entropy": 1.1291529774665832,
+      "epoch": 0.31307550644567217,
+      "grad_norm": 0.652715802192688,
+      "learning_rate": 6.213235294117647e-05,
+      "loss": 0.14441155195236205,
+      "mean_token_accuracy": 0.9466125547885895,
+      "num_tokens": 3493682.0,
+      "step": 170
+    },
+    {
+      "entropy": 1.1244838953018188,
+      "epoch": 0.3314917127071823,
+      "grad_norm": 0.7815241813659668,
+      "learning_rate": 6.580882352941177e-05,
+      "loss": 0.13361064195632935,
+      "mean_token_accuracy": 0.9512295544147491,
+      "num_tokens": 3699573.0,
+      "step": 180
+    },
+    {
+      "entropy": 1.1217721104621887,
+      "epoch": 0.34990791896869244,
+      "grad_norm": 0.7933160066604614,
+      "learning_rate": 6.948529411764706e-05,
+      "loss": 0.13089522123336791,
+      "mean_token_accuracy": 0.9520221531391144,
+      "num_tokens": 3905156.0,
+      "step": 190
+    },
+    {
+      "entropy": 1.1206679105758668,
+      "epoch": 0.3683241252302026,
+      "grad_norm": 0.6815240383148193,
+      "learning_rate": 7.316176470588236e-05,
+      "loss": 0.13400404453277587,
+      "mean_token_accuracy": 0.9501322209835052,
+      "num_tokens": 4110570.0,
+      "step": 200
+    },
+    {
+      "entropy": 1.1161052227020263,
+      "epoch": 0.3867403314917127,
+      "grad_norm": 0.8297767639160156,
+      "learning_rate": 7.683823529411766e-05,
+      "loss": 0.13389937877655028,
+      "mean_token_accuracy": 0.9501932203769684,
+      "num_tokens": 4315834.0,
+      "step": 210
+    },
+    {
+      "entropy": 1.1098745942115784,
+      "epoch": 0.40515653775322286,
+      "grad_norm": 0.5943381786346436,
+      "learning_rate": 8.051470588235294e-05,
+      "loss": 0.13452907800674438,
+      "mean_token_accuracy": 0.9503286242485046,
+      "num_tokens": 4520807.0,
+      "step": 220
+    },
+    {
+      "entropy": 1.100480353832245,
+      "epoch": 0.42357274401473294,
+      "grad_norm": 0.6094359755516052,
+      "learning_rate": 8.419117647058824e-05,
+      "loss": 0.12827746868133544,
+      "mean_token_accuracy": 0.952492094039917,
+      "num_tokens": 4725867.0,
+      "step": 230
+    },
+    {
+      "entropy": 1.0901286959648133,
+      "epoch": 0.4419889502762431,
+      "grad_norm": 0.7240597605705261,
+      "learning_rate": 8.786764705882353e-05,
+      "loss": 0.12171242237091065,
+      "mean_token_accuracy": 0.953943532705307,
+      "num_tokens": 4931629.0,
+      "step": 240
+    },
+    {
+      "entropy": 1.0885071873664856,
+      "epoch": 0.4604051565377532,
+      "grad_norm": 0.6939547657966614,
+      "learning_rate": 9.154411764705882e-05,
+      "loss": 0.12155698537826538,
+      "mean_token_accuracy": 0.9545870959758759,
+      "num_tokens": 5137285.0,
+      "step": 250
+    },
+    {
+      "entropy": 1.086272156238556,
+      "epoch": 0.47882136279926335,
+      "grad_norm": 0.5752800703048706,
+      "learning_rate": 9.522058823529412e-05,
+      "loss": 0.12157790660858155,
+      "mean_token_accuracy": 0.9541126549243927,
+      "num_tokens": 5342575.0,
+      "step": 260
+    },
+    {
+      "entropy": 1.0857678413391114,
+      "epoch": 0.4972375690607735,
+      "grad_norm": 0.7565123438835144,
+      "learning_rate": 9.889705882352942e-05,
+      "loss": 0.12349612712860107,
+      "mean_token_accuracy": 0.9535140514373779,
+      "num_tokens": 5547995.0,
+      "step": 270
+    },
+    {
+      "entropy": 1.079762625694275,
+      "epoch": 0.5156537753222836,
+      "grad_norm": 0.6972768306732178,
+      "learning_rate": 9.999954556423843e-05,
+      "loss": 0.11875582933425903,
+      "mean_token_accuracy": 0.9556483089923858,
+      "num_tokens": 5753195.0,
+      "step": 280
+    },
+    {
+      "entropy": 1.0742079138755798,
+      "epoch": 0.5340699815837937,
+      "grad_norm": 0.7821696996688843,
+      "learning_rate": 9.999731977631227e-05,
+      "loss": 0.11824090480804443,
+      "mean_token_accuracy": 0.9557521045207977,
+      "num_tokens": 5958236.0,
+      "step": 290
+    },
+    {
+      "entropy": 1.0679773569107056,
+      "epoch": 0.5524861878453039,
+      "grad_norm": 0.5846888422966003,
+      "learning_rate": 9.999323925089486e-05,
+      "loss": 0.11707355976104736,
+      "mean_token_accuracy": 0.9554719448089599,
+      "num_tokens": 6163992.0,
+      "step": 300
+    },
+    {
+      "entropy": 1.0655727863311768,
+      "epoch": 0.570902394106814,
+      "grad_norm": 0.5812502503395081,
+      "learning_rate": 9.998730413936037e-05,
+      "loss": 0.11371417045593261,
+      "mean_token_accuracy": 0.9576376020908356,
+      "num_tokens": 6369456.0,
+      "step": 310
+    },
+    {
+      "entropy": 1.0607039332389832,
+      "epoch": 0.5893186003683242,
+      "grad_norm": 0.6238475441932678,
+      "learning_rate": 9.99795146618821e-05,
+      "loss": 0.11775733232498169,
+      "mean_token_accuracy": 0.9557221591472626,
+      "num_tokens": 6574833.0,
+      "step": 320
+    },
+    {
+      "entropy": 1.0504255175590516,
+      "epoch": 0.6077348066298343,
+      "grad_norm": 0.6496815085411072,
+      "learning_rate": 9.996987110742422e-05,
+      "loss": 0.10904088020324706,
+      "mean_token_accuracy": 0.9585366368293762,
+      "num_tokens": 6780108.0,
+      "step": 330
+    },
+    {
+      "entropy": 1.0456081986427308,
+      "epoch": 0.6261510128913443,
+      "grad_norm": 0.786702573299408,
+      "learning_rate": 9.995837383373119e-05,
+      "loss": 0.10642309188842773,
+      "mean_token_accuracy": 0.9596696078777314,
+      "num_tokens": 6985920.0,
+      "step": 340
+    },
+    {
+      "entropy": 1.0455098271369934,
+      "epoch": 0.6445672191528545,
+      "grad_norm": 0.5473790168762207,
+      "learning_rate": 9.994502326731434e-05,
+      "loss": 0.10822961330413819,
+      "mean_token_accuracy": 0.959563136100769,
+      "num_tokens": 7191465.0,
+      "step": 350
+    },
+    {
+      "entropy": 1.04240562915802,
+      "epoch": 0.6629834254143646,
+      "grad_norm": 0.6672356128692627,
+      "learning_rate": 9.992981990343614e-05,
+      "loss": 0.1110004186630249,
+      "mean_token_accuracy": 0.9582514643669129,
+      "num_tokens": 7396877.0,
+      "step": 360
+    },
+    {
+      "entropy": 1.0386811256408692,
+      "epoch": 0.6813996316758748,
+      "grad_norm": 0.698539674282074,
+      "learning_rate": 9.99127643060918e-05,
+      "loss": 0.107539963722229,
+      "mean_token_accuracy": 0.9593036234378814,
+      "num_tokens": 7602437.0,
+      "step": 370
+    },
+    {
+      "entropy": 1.0311225533485413,
+      "epoch": 0.6998158379373849,
+      "grad_norm": 0.6629284024238586,
+      "learning_rate": 9.989385710798837e-05,
+      "loss": 0.1064023494720459,
+      "mean_token_accuracy": 0.9602205216884613,
+      "num_tokens": 7808142.0,
+      "step": 380
+    },
+    {
+      "entropy": 1.030210506916046,
+      "epoch": 0.7182320441988951,
+      "grad_norm": 0.5616748929023743,
+      "learning_rate": 9.987309901052121e-05,
+      "loss": 0.10717041492462158,
+      "mean_token_accuracy": 0.9599347949028015,
+      "num_tokens": 8013407.0,
+      "step": 390
+    },
+    {
+      "entropy": 1.0208017826080322,
+      "epoch": 0.7366482504604052,
+      "grad_norm": 0.6329049468040466,
+      "learning_rate": 9.985049078374806e-05,
+      "loss": 0.10359601974487305,
+      "mean_token_accuracy": 0.9603756129741668,
+      "num_tokens": 8219040.0,
+      "step": 400
+    },
+    {
+      "entropy": 1.015640377998352,
+      "epoch": 0.7550644567219152,
+      "grad_norm": 0.6516013741493225,
+      "learning_rate": 9.982603326636037e-05,
+      "loss": 0.10146439075469971,
+      "mean_token_accuracy": 0.9627702474594116,
+      "num_tokens": 8424678.0,
+      "step": 410
+    },
+    {
+      "entropy": 1.0105359435081482,
+      "epoch": 0.7734806629834254,
+      "grad_norm": 0.6920603513717651,
+      "learning_rate": 9.979972736565226e-05,
+      "loss": 0.10770498514175415,
+      "mean_token_accuracy": 0.9591470420360565,
+      "num_tokens": 8629868.0,
+      "step": 420
+    },
+    {
+      "entropy": 0.9966452836990356,
+      "epoch": 0.7918968692449355,
+      "grad_norm": 0.6857476234436035,
+      "learning_rate": 9.977157405748687e-05,
+      "loss": 0.10282524824142455,
+      "mean_token_accuracy": 0.9612209022045135,
+      "num_tokens": 8835320.0,
+      "step": 430
+    },
+    {
+      "entropy": 0.9945534646511078,
+      "epoch": 0.8103130755064457,
+      "grad_norm": 0.7208472490310669,
+      "learning_rate": 9.974157438626008e-05,
+      "loss": 0.10069938898086547,
+      "mean_token_accuracy": 0.9620070576667785,
+      "num_tokens": 9041123.0,
+      "step": 440
+    },
+    {
+      "entropy": 0.979461395740509,
+      "epoch": 0.8287292817679558,
+      "grad_norm": 0.5071915984153748,
+      "learning_rate": 9.970972946486185e-05,
+      "loss": 0.09799174070358277,
+      "mean_token_accuracy": 0.9620374023914338,
+      "num_tokens": 9246361.0,
+      "step": 450
+    },
+    {
+      "entropy": 0.9830998003482818,
+      "epoch": 0.8471454880294659,
+      "grad_norm": 0.8660802245140076,
+      "learning_rate": 9.967604047463493e-05,
+      "loss": 0.10378165245056152,
+      "mean_token_accuracy": 0.9606865763664245,
+      "num_tokens": 9451845.0,
+      "step": 460
+    },
+    {
+      "entropy": 0.9813413023948669,
+      "epoch": 0.8655616942909761,
+      "grad_norm": 0.7642477750778198,
+      "learning_rate": 9.964050866533094e-05,
+      "loss": 0.1010061264038086,
+      "mean_token_accuracy": 0.9608745336532593,
+      "num_tokens": 9656802.0,
+      "step": 470
+    },
+    {
+      "entropy": 0.967874163389206,
+      "epoch": 0.8839779005524862,
+      "grad_norm": 0.5987281799316406,
+      "learning_rate": 9.960313535506411e-05,
+      "loss": 0.10169394016265869,
+      "mean_token_accuracy": 0.9611998200416565,
+      "num_tokens": 9861719.0,
+      "step": 480
+    },
+    {
+      "entropy": 0.9663491308689117,
+      "epoch": 0.9023941068139963,
+      "grad_norm": 0.6124638319015503,
+      "learning_rate": 9.956392193026239e-05,
+      "loss": 0.102389657497406,
+      "mean_token_accuracy": 0.9611884355545044,
+      "num_tokens": 10066673.0,
+      "step": 490
+    },
+    {
+      "entropy": 0.959654438495636,
+      "epoch": 0.9208103130755064,
+      "grad_norm": 0.7873051762580872,
+      "learning_rate": 9.952286984561592e-05,
+      "loss": 0.10170392990112305,
+      "mean_token_accuracy": 0.9610928475856781,
+      "num_tokens": 10272091.0,
+      "step": 500
+    },
+    {
+      "entropy": 0.9550537407398224,
+      "epoch": 0.9392265193370166,
+      "grad_norm": 0.6071968078613281,
+      "learning_rate": 9.947998062402313e-05,
+      "loss": 0.09448277950286865,
+      "mean_token_accuracy": 0.9648977637290954,
+      "num_tokens": 10477632.0,
+      "step": 510
+    },
+    {
+      "entropy": 0.9538533687591553,
+      "epoch": 0.9576427255985267,
+      "grad_norm": 0.6317242980003357,
+      "learning_rate": 9.943525585653428e-05,
+      "loss": 0.09542192220687866,
+      "mean_token_accuracy": 0.9635261118412017,
+      "num_tokens": 10682828.0,
+      "step": 520
+    },
+    {
+      "entropy": 0.9362513542175293,
+      "epoch": 0.9760589318600368,
+      "grad_norm": 0.6421944499015808,
+      "learning_rate": 9.938869720229234e-05,
+      "loss": 0.09382058382034301,
+      "mean_token_accuracy": 0.9648073971271515,
+      "num_tokens": 10888741.0,
+      "step": 530
+    },
+    {
+      "entropy": 0.9235438346862793,
+      "epoch": 0.994475138121547,
+      "grad_norm": 0.7986873388290405,
+      "learning_rate": 9.934030638847155e-05,
+      "loss": 0.09827429056167603,
+      "mean_token_accuracy": 0.9621128737926483,
+      "num_tokens": 11094387.0,
+      "step": 540
+    },
+    {
+      "epoch": 1.0,
+      "eval_entropy": 0.9137652366057686,
+      "eval_loss": 0.09368764609098434,
+      "eval_mean_token_accuracy": 0.9640816880309063,
+      "eval_num_tokens": 11155908.0,
+      "eval_runtime": 10.4701,
+      "eval_samples_per_second": 349.377,
+      "eval_steps_per_second": 10.984,
+      "step": 543
+    },
+    {
+      "entropy": 0.9047818422317505,
+      "epoch": 1.0128913443830572,
+      "grad_norm": 0.6781501173973083,
+      "learning_rate": 9.929008521021325e-05,
+      "loss": 0.0863916516304016,
+      "mean_token_accuracy": 0.9673655688762665,
+      "num_tokens": 11299715.0,
+      "step": 550
+    },
+    {
+      "entropy": 0.8856981039047241,
+      "epoch": 1.0313075506445673,
+      "grad_norm": 0.7143136858940125,
+      "learning_rate": 9.923803553055937e-05,
+      "loss": 0.08632323145866394,
+      "mean_token_accuracy": 0.9677783191204071,
+      "num_tokens": 11505059.0,
+      "step": 560
+    },
+    {
+      "entropy": 0.8937099635601043,
+      "epoch": 1.0497237569060773,
+      "grad_norm": 0.7751694321632385,
+      "learning_rate": 9.918415928038325e-05,
+      "loss": 0.08178263902664185,
+      "mean_token_accuracy": 0.9694291114807129,
+      "num_tokens": 11710464.0,
+      "step": 570
+    },
+    {
+      "entropy": 0.8858704209327698,
+      "epoch": 1.0681399631675874,
+      "grad_norm": 0.7492292523384094,
+      "learning_rate": 9.912845845831805e-05,
+      "loss": 0.08074211478233337,
+      "mean_token_accuracy": 0.9692470014095307,
+      "num_tokens": 11915959.0,
+      "step": 580
+    },
+    {
+      "entropy": 0.8948039829730987,
+      "epoch": 1.0865561694290977,
+      "grad_norm": 0.8116479516029358,
+      "learning_rate": 9.907093513068259e-05,
+      "loss": 0.08712012171745301,
+      "mean_token_accuracy": 0.9669980227947235,
+      "num_tokens": 12121499.0,
+      "step": 590
+    },
+    {
+      "entropy": 0.8846789538860321,
+      "epoch": 1.1049723756906078,
+      "grad_norm": 0.7295626997947693,
+      "learning_rate": 9.901159143140471e-05,
+      "loss": 0.08444435596466064,
+      "mean_token_accuracy": 0.9674544095993042,
+      "num_tokens": 12327061.0,
+      "step": 600
+    },
+    {
+      "entropy": 0.8734103918075562,
+      "epoch": 1.1233885819521179,
+      "grad_norm": 0.9585768580436707,
+      "learning_rate": 9.89504295619421e-05,
+      "loss": 0.08022565841674804,
+      "mean_token_accuracy": 0.969569206237793,
+      "num_tokens": 12532305.0,
+      "step": 610
+    },
+    {
+      "entropy": 0.8640486001968384,
+      "epoch": 1.141804788213628,
+      "grad_norm": 0.7891159057617188,
+      "learning_rate": 9.88874517912006e-05,
+      "loss": 0.08415375947952271,
+      "mean_token_accuracy": 0.9678892493247986,
+      "num_tokens": 12737828.0,
+      "step": 620
+    },
+    {
+      "entropy": 0.8599755525588989,
+      "epoch": 1.160220994475138,
+      "grad_norm": 0.5801345109939575,
+      "learning_rate": 9.882266045545012e-05,
+      "loss": 0.08100489974021911,
+      "mean_token_accuracy": 0.9688023269176483,
+      "num_tokens": 12943343.0,
+      "step": 630
+    },
+    {
+      "entropy": 0.86524977684021,
+      "epoch": 1.1786372007366483,
+      "grad_norm": 0.7633041143417358,
+      "learning_rate": 9.87560579582379e-05,
+      "loss": 0.07859406471252442,
+      "mean_token_accuracy": 0.9702189445495606,
+      "num_tokens": 13148473.0,
+      "step": 640
+    },
+    {
+      "entropy": 0.8466695249080658,
+      "epoch": 1.1970534069981584,
+      "grad_norm": 0.8672215938568115,
+      "learning_rate": 9.868764677029934e-05,
+      "loss": 0.08082623481750488,
+      "mean_token_accuracy": 0.9689972400665283,
+      "num_tokens": 13353890.0,
+      "step": 650
+    },
+    {
+      "entropy": 0.8596941530704498,
+      "epoch": 1.2154696132596685,
+      "grad_norm": 0.7524124383926392,
+      "learning_rate": 9.861742942946639e-05,
+      "loss": 0.0789935290813446,
+      "mean_token_accuracy": 0.9693858206272126,
+      "num_tokens": 13559475.0,
+      "step": 660
+    },
+    {
+      "entropy": 0.8708749234676361,
+      "epoch": 1.2338858195211786,
+      "grad_norm": 0.5777031183242798,
+      "learning_rate": 9.854540854057337e-05,
+      "loss": 0.07773642539978028,
+      "mean_token_accuracy": 0.970385092496872,
+      "num_tokens": 13765076.0,
+      "step": 670
+    },
+    {
+      "entropy": 0.8651713371276856,
+      "epoch": 1.2523020257826887,
+      "grad_norm": 0.7924166321754456,
+      "learning_rate": 9.847158677536034e-05,
+      "loss": 0.0766686737537384,
+      "mean_token_accuracy": 0.9702267110347748,
+      "num_tokens": 13970642.0,
+      "step": 680
+    },
+    {
+      "entropy": 0.8763024985790253,
+      "epoch": 1.270718232044199,
+      "grad_norm": 0.741219162940979,
+      "learning_rate": 9.839596687237403e-05,
+      "loss": 0.07189929485321045,
+      "mean_token_accuracy": 0.9727097094058991,
+      "num_tokens": 14176556.0,
+      "step": 690
+    },
+    {
+      "entropy": 0.8556921362876893,
+      "epoch": 1.289134438305709,
+      "grad_norm": 0.6298198103904724,
+      "learning_rate": 9.831855163686618e-05,
+      "loss": 0.07608137726783752,
+      "mean_token_accuracy": 0.9716399371623993,
+      "num_tokens": 14381686.0,
+      "step": 700
+    },
+    {
+      "entropy": 0.869178420305252,
+      "epoch": 1.3075506445672191,
+      "grad_norm": 0.5850273370742798,
+      "learning_rate": 9.823934394068952e-05,
+      "loss": 0.07437651753425598,
+      "mean_token_accuracy": 0.9709566533565521,
+      "num_tokens": 14586814.0,
+      "step": 710
+    },
+    {
+      "entropy": 0.8708595156669616,
+      "epoch": 1.3259668508287292,
+      "grad_norm": 0.6580632328987122,
+      "learning_rate": 9.815834672219127e-05,
+      "loss": 0.07518917322158813,
+      "mean_token_accuracy": 0.9717426657676697,
+      "num_tokens": 14792321.0,
+      "step": 720
+    },
+    {
+      "entropy": 0.8826817810535431,
+      "epoch": 1.3443830570902393,
+      "grad_norm": 0.8788532018661499,
+      "learning_rate": 9.807556298610404e-05,
+      "loss": 0.07579240798950196,
+      "mean_token_accuracy": 0.9706341981887817,
+      "num_tokens": 14997810.0,
+      "step": 730
+    },
+    {
+      "entropy": 0.9012470185756684,
+      "epoch": 1.3627992633517496,
+      "grad_norm": 0.7022138237953186,
+      "learning_rate": 9.799099580343441e-05,
+      "loss": 0.0775588572025299,
+      "mean_token_accuracy": 0.9699241399765015,
+      "num_tokens": 15203795.0,
+      "step": 740
+    },
+    {
+      "entropy": 0.886955714225769,
+      "epoch": 1.3812154696132597,
+      "grad_norm": 0.7881133556365967,
+      "learning_rate": 9.790464831134903e-05,
+      "loss": 0.07125020027160645,
+      "mean_token_accuracy": 0.9723815560340882,
+      "num_tokens": 15408974.0,
+      "step": 750
+    },
+    {
+      "entropy": 0.9047374844551086,
+      "epoch": 1.3996316758747698,
+      "grad_norm": 0.9082005023956299,
+      "learning_rate": 9.781652371305824e-05,
+      "loss": 0.07004334926605224,
+      "mean_token_accuracy": 0.9725580036640167,
+      "num_tokens": 15614399.0,
+      "step": 760
+    },
+    {
+      "entropy": 0.9039053857326508,
+      "epoch": 1.4180478821362799,
+      "grad_norm": 0.8060817122459412,
+      "learning_rate": 9.77266252776972e-05,
+      "loss": 0.07103485465049744,
+      "mean_token_accuracy": 0.9721468150615692,
+      "num_tokens": 15819895.0,
+      "step": 770
+    },
+    {
+      "entropy": 0.8998047232627868,
+      "epoch": 1.43646408839779,
+      "grad_norm": 1.0152642726898193,
+      "learning_rate": 9.763495634020467e-05,
+      "loss": 0.07411704063415528,
+      "mean_token_accuracy": 0.9711063146591187,
+      "num_tokens": 16025297.0,
+      "step": 780
+    },
+    {
+      "entropy": 0.9120213568210602,
+      "epoch": 1.4548802946593002,
+      "grad_norm": 0.6288319826126099,
+      "learning_rate": 9.754152030119921e-05,
+      "loss": 0.07223712205886841,
+      "mean_token_accuracy": 0.9722476422786712,
+      "num_tokens": 16230656.0,
+      "step": 790
+    },
+    {
+      "entropy": 0.9142370820045471,
+      "epoch": 1.4732965009208103,
+      "grad_norm": 0.7854700088500977,
+      "learning_rate": 9.744632062685311e-05,
+      "loss": 0.07186744809150696,
+      "mean_token_accuracy": 0.972247713804245,
+      "num_tokens": 16435943.0,
+      "step": 800
+    },
+    {
+      "entropy": 0.8920814216136932,
+      "epoch": 1.4917127071823204,
+      "grad_norm": 0.6227074265480042,
+      "learning_rate": 9.734936084876383e-05,
+      "loss": 0.07016961574554444,
+      "mean_token_accuracy": 0.9725603640079499,
+      "num_tokens": 16641635.0,
+      "step": 810
+    },
+    {
+      "entropy": 0.891328877210617,
+      "epoch": 1.5101289134438307,
+      "grad_norm": 0.7601346969604492,
+      "learning_rate": 9.725064456382283e-05,
+      "loss": 0.07137494087219239,
+      "mean_token_accuracy": 0.9722997546195984,
+      "num_tokens": 16847194.0,
+      "step": 820
+    },
+    {
+      "entropy": 0.8921217978000641,
+      "epoch": 1.5285451197053406,
+      "grad_norm": 0.7813850045204163,
+      "learning_rate": 9.715017543408233e-05,
+      "loss": 0.06890199184417725,
+      "mean_token_accuracy": 0.9735044002532959,
+      "num_tokens": 17052807.0,
+      "step": 830
+    },
+    {
+      "entropy": 0.9085914671421051,
+      "epoch": 1.5469613259668509,
+      "grad_norm": 0.6184289455413818,
+      "learning_rate": 9.704795718661939e-05,
+      "loss": 0.07043765187263488,
+      "mean_token_accuracy": 0.9725716531276702,
+      "num_tokens": 17258284.0,
+      "step": 840
+    },
+    {
+      "entropy": 0.9029861629009247,
+      "epoch": 1.565377532228361,
+      "grad_norm": 0.7082377076148987,
+      "learning_rate": 9.694399361339752e-05,
+      "loss": 0.07113839387893676,
+      "mean_token_accuracy": 0.9725669205188752,
+      "num_tokens": 17464326.0,
+      "step": 850
+    },
+    {
+      "entropy": 0.8856533527374267,
+      "epoch": 1.583793738489871,
+      "grad_norm": 0.7409216165542603,
+      "learning_rate": 9.683828857112627e-05,
+      "loss": 0.07077333331108093,
+      "mean_token_accuracy": 0.9731084644794464,
+      "num_tokens": 17669537.0,
+      "step": 860
+    },
+    {
+      "entropy": 0.8613030433654785,
+      "epoch": 1.6022099447513813,
+      "grad_norm": 0.6801561713218689,
+      "learning_rate": 9.673084598111789e-05,
+      "loss": 0.06885308027267456,
+      "mean_token_accuracy": 0.97266526222229,
+      "num_tokens": 17875289.0,
+      "step": 870
+    },
+    {
+      "entropy": 0.8692965865135193,
+      "epoch": 1.6206261510128912,
+      "grad_norm": 1.1621277332305908,
+      "learning_rate": 9.662166982914203e-05,
+      "loss": 0.07017780542373657,
+      "mean_token_accuracy": 0.9733059942722321,
+      "num_tokens": 18080404.0,
+      "step": 880
+    },
+    {
+      "entropy": 0.8671502113342285,
+      "epoch": 1.6390423572744015,
+      "grad_norm": 0.7518903613090515,
+      "learning_rate": 9.651076416527787e-05,
+      "loss": 0.06977018713951111,
+      "mean_token_accuracy": 0.9730017304420471,
+      "num_tokens": 18285699.0,
+      "step": 890
+    },
+    {
+      "entropy": 0.8662045657634735,
+      "epoch": 1.6574585635359116,
+      "grad_norm": 0.6622698903083801,
+      "learning_rate": 9.639813310376378e-05,
+      "loss": 0.06620995998382569,
+      "mean_token_accuracy": 0.9737491130828857,
+      "num_tokens": 18491097.0,
+      "step": 900
+    },
+    {
+      "entropy": 0.8548173069953918,
+      "epoch": 1.6758747697974217,
+      "grad_norm": 0.8941843509674072,
+      "learning_rate": 9.628378082284479e-05,
+      "loss": 0.06711119413375854,
+      "mean_token_accuracy": 0.9740589797496796,
+      "num_tokens": 18696827.0,
+      "step": 910
+    },
+    {
+      "entropy": 0.8763562262058258,
+      "epoch": 1.694290976058932,
+      "grad_norm": 0.7571700215339661,
+      "learning_rate": 9.616771156461755e-05,
+      "loss": 0.07263468503952027,
+      "mean_token_accuracy": 0.9717419981956482,
+      "num_tokens": 18902513.0,
+      "step": 920
+    },
+    {
+      "entropy": 0.8663733780384064,
+      "epoch": 1.7127071823204418,
+      "grad_norm": 0.7886489629745483,
+      "learning_rate": 9.604992963487298e-05,
+      "loss": 0.07074605226516724,
+      "mean_token_accuracy": 0.9724965393543243,
+      "num_tokens": 19107812.0,
+      "step": 930
+    },
+    {
+      "entropy": 0.8673004627227783,
+      "epoch": 1.7311233885819521,
+      "grad_norm": 0.8180726170539856,
+      "learning_rate": 9.593043940293647e-05,
+      "loss": 0.06831735372543335,
+      "mean_token_accuracy": 0.9733696818351746,
+      "num_tokens": 19313330.0,
+      "step": 940
+    },
+    {
+      "entropy": 0.8525971233844757,
+      "epoch": 1.7495395948434622,
+      "grad_norm": 0.6576228737831116,
+      "learning_rate": 9.580924530150595e-05,
+      "loss": 0.06567002534866333,
+      "mean_token_accuracy": 0.9745754361152649,
+      "num_tokens": 19518671.0,
+      "step": 950
+    },
+    {
+      "entropy": 0.8605451703071594,
+      "epoch": 1.7679558011049723,
+      "grad_norm": 0.7171661257743835,
+      "learning_rate": 9.568635182648725e-05,
+      "loss": 0.06872050762176514,
+      "mean_token_accuracy": 0.9732091546058654,
+      "num_tokens": 19724135.0,
+      "step": 960
+    },
+    {
+      "entropy": 0.8642210960388184,
+      "epoch": 1.7863720073664826,
+      "grad_norm": 0.7603147029876709,
+      "learning_rate": 9.556176353682746e-05,
+      "loss": 0.06766576766967773,
+      "mean_token_accuracy": 0.9728681743144989,
+      "num_tokens": 19928785.0,
+      "step": 970
+    },
+    {
+      "entropy": 0.8543185651302337,
+      "epoch": 1.8047882136279927,
+      "grad_norm": 0.7280875444412231,
+      "learning_rate": 9.543548505434581e-05,
+      "loss": 0.06851862668991089,
+      "mean_token_accuracy": 0.9737437188625335,
+      "num_tokens": 20134195.0,
+      "step": 980
+    },
+    {
+      "entropy": 0.8744745373725891,
+      "epoch": 1.8232044198895028,
+      "grad_norm": 0.5897248983383179,
+      "learning_rate": 9.530752106356209e-05,
+      "loss": 0.06809053421020508,
+      "mean_token_accuracy": 0.9733593761920929,
+      "num_tokens": 20339517.0,
+      "step": 990
+    },
+    {
+      "entropy": 0.8623859465122223,
+      "epoch": 1.8416206261510129,
+      "grad_norm": 0.7515265345573425,
+      "learning_rate": 9.517787631152298e-05,
+      "loss": 0.07257847785949707,
+      "mean_token_accuracy": 0.9714054942131043,
+      "num_tokens": 20545249.0,
+      "step": 1000
+    },
+    {
+      "entropy": 0.8669404804706573,
+      "epoch": 1.860036832412523,
+      "grad_norm": 0.7144560813903809,
+      "learning_rate": 9.504655560762596e-05,
+      "loss": 0.06832354068756104,
+      "mean_token_accuracy": 0.9735779523849487,
+      "num_tokens": 20750507.0,
+      "step": 1010
+    },
+    {
+      "entropy": 0.8493516445159912,
+      "epoch": 1.8784530386740332,
+      "grad_norm": 0.6559189558029175,
+      "learning_rate": 9.491356382344081e-05,
+      "loss": 0.0629766047000885,
+      "mean_token_accuracy": 0.9754977762699127,
+      "num_tokens": 20955956.0,
+      "step": 1020
+    },
+    {
+      "entropy": 0.8599376022815705,
+      "epoch": 1.8968692449355433,
+      "grad_norm": 0.6792973279953003,
+      "learning_rate": 9.477890589252895e-05,
+      "loss": 0.0666757881641388,
+      "mean_token_accuracy": 0.974083811044693,
+      "num_tokens": 21161163.0,
+      "step": 1030
+    },
+    {
+      "entropy": 0.8458438158035279,
+      "epoch": 1.9152854511970534,
+      "grad_norm": 0.6941778659820557,
+      "learning_rate": 9.464258681026042e-05,
+      "loss": 0.06307152509689332,
+      "mean_token_accuracy": 0.9757042229175568,
+      "num_tokens": 21366525.0,
+      "step": 1040
+    },
+    {
+      "entropy": 0.848515909910202,
+      "epoch": 1.9337016574585635,
+      "grad_norm": 0.7307806611061096,
+      "learning_rate": 9.450461163362855e-05,
+      "loss": 0.06307026147842407,
+      "mean_token_accuracy": 0.9750974595546722,
+      "num_tokens": 21572238.0,
+      "step": 1050
+    },
+    {
+      "entropy": 0.8563454031944275,
+      "epoch": 1.9521178637200736,
+      "grad_norm": 0.7222106456756592,
+      "learning_rate": 9.436498548106236e-05,
+      "loss": 0.0647726058959961,
+      "mean_token_accuracy": 0.974629694223404,
+      "num_tokens": 21777633.0,
+      "step": 1060
+    },
+    {
+      "entropy": 0.8656457483768463,
+      "epoch": 1.9705340699815839,
+      "grad_norm": 0.67178875207901,
+      "learning_rate": 9.422371353223674e-05,
+      "loss": 0.06573554277420043,
+      "mean_token_accuracy": 0.9745908617973328,
+      "num_tokens": 21983116.0,
+      "step": 1070
+    },
+    {
+      "entropy": 0.8630891263484954,
+      "epoch": 1.988950276243094,
+      "grad_norm": 0.6956593990325928,
+      "learning_rate": 9.408080102788016e-05,
+      "loss": 0.06630704402923585,
+      "mean_token_accuracy": 0.9741333484649658,
+      "num_tokens": 22188662.0,
+      "step": 1080
+    },
+    {
+      "epoch": 2.0,
+      "eval_entropy": 0.8560857042022373,
+      "eval_loss": 0.06494329869747162,
+      "eval_mean_token_accuracy": 0.9745692672936813,
+      "eval_num_tokens": 22311800.0,
+      "eval_runtime": 10.129,
+      "eval_samples_per_second": 361.142,
+      "eval_steps_per_second": 11.354,
+      "step": 1086
+    },
+    {
+      "entropy": 0.8616272270679474,
+      "epoch": 2.007366482504604,
+      "grad_norm": 0.7778105139732361,
+      "learning_rate": 9.393625326958041e-05,
+      "loss": 0.054407155513763426,
+      "mean_token_accuracy": 0.9792074799537659,
+      "num_tokens": 22394215.0,
+      "step": 1090
+    },
+    {
+      "entropy": 0.8496910452842712,
+      "epoch": 2.0257826887661143,
+      "grad_norm": 0.7422528266906738,
+      "learning_rate": 9.379007561958792e-05,
+      "loss": 0.051881587505340575,
+      "mean_token_accuracy": 0.9799090325832367,
+      "num_tokens": 22599599.0,
+      "step": 1100
+    },
+    {
+      "entropy": 0.8531602442264556,
+      "epoch": 2.044198895027624,
+      "grad_norm": 0.9075332880020142,
+      "learning_rate": 9.36422735006167e-05,
+      "loss": 0.05190724730491638,
+      "mean_token_accuracy": 0.979931116104126,
+      "num_tokens": 22805318.0,
+      "step": 1110
+    },
+    {
+      "entropy": 0.8657277703285218,
+      "epoch": 2.0626151012891345,
+      "grad_norm": 0.9466913938522339,
+      "learning_rate": 9.349285239564325e-05,
+      "loss": 0.053853434324264524,
+      "mean_token_accuracy": 0.9796103596687317,
+      "num_tokens": 23010438.0,
+      "step": 1120
+    },
+    {
+      "entropy": 0.8578485429286957,
+      "epoch": 2.0810313075506444,
+      "grad_norm": 0.6903054714202881,
+      "learning_rate": 9.334181784770326e-05,
+      "loss": 0.05228850841522217,
+      "mean_token_accuracy": 0.9802409887313843,
+      "num_tokens": 23215795.0,
+      "step": 1130
+    },
+    {
+      "entropy": 0.8450767934322357,
+      "epoch": 2.0994475138121547,
+      "grad_norm": 0.6615211367607117,
+      "learning_rate": 9.318917545968581e-05,
+      "loss": 0.050570905208587646,
+      "mean_token_accuracy": 0.9802053451538086,
+      "num_tokens": 23421157.0,
+      "step": 1140
+    },
+    {
+      "entropy": 0.8325044393539429,
+      "epoch": 2.117863720073665,
+      "grad_norm": 0.760960578918457,
+      "learning_rate": 9.303493089412564e-05,
+      "loss": 0.051966112852096555,
+      "mean_token_accuracy": 0.9796205997467041,
+      "num_tokens": 23626584.0,
+      "step": 1150
+    },
+    {
+      "entropy": 0.8416404843330383,
+      "epoch": 2.136279926335175,
+      "grad_norm": 0.6947009563446045,
+      "learning_rate": 9.287908987299306e-05,
+      "loss": 0.05144861936569214,
+      "mean_token_accuracy": 0.9800034642219544,
+      "num_tokens": 23832137.0,
+      "step": 1160
+    },
+    {
+      "entropy": 0.8564540028572083,
+      "epoch": 2.154696132596685,
+      "grad_norm": 0.733252763748169,
+      "learning_rate": 9.272165817748164e-05,
+      "loss": 0.04944799542427063,
+      "mean_token_accuracy": 0.9808157980442047,
+      "num_tokens": 24038006.0,
+      "step": 1170
+    },
+    {
+      "entropy": 0.8575525343418121,
+      "epoch": 2.1731123388581954,
+      "grad_norm": 0.8911028504371643,
+      "learning_rate": 9.25626416477938e-05,
+      "loss": 0.05037952661514282,
+      "mean_token_accuracy": 0.980946284532547,
+      "num_tokens": 24243374.0,
+      "step": 1180
+    },
+    {
+      "entropy": 0.8599720418453216,
+      "epoch": 2.1915285451197053,
+      "grad_norm": 0.7713524103164673,
+      "learning_rate": 9.240204618292416e-05,
+      "loss": 0.050603735446929934,
+      "mean_token_accuracy": 0.980896121263504,
+      "num_tokens": 24448585.0,
+      "step": 1190
+    },
+    {
+      "entropy": 0.8566664934158326,
+      "epoch": 2.2099447513812156,
+      "grad_norm": 0.8439353704452515,
+      "learning_rate": 9.223987774044066e-05,
+      "loss": 0.054171699285507205,
+      "mean_token_accuracy": 0.9796543836593627,
+      "num_tokens": 24653863.0,
+      "step": 1200
+    },
+    {
+      "entropy": 0.846601277589798,
+      "epoch": 2.2283609576427255,
+      "grad_norm": 0.7025637030601501,
+      "learning_rate": 9.207614233626356e-05,
+      "loss": 0.048924127221107484,
+      "mean_token_accuracy": 0.9809681415557862,
+      "num_tokens": 24859801.0,
+      "step": 1210
+    },
+    {
+      "entropy": 0.8564423739910125,
+      "epoch": 2.2467771639042358,
+      "grad_norm": 0.7788274884223938,
+      "learning_rate": 9.191084604444233e-05,
+      "loss": 0.05260283350944519,
+      "mean_token_accuracy": 0.9793797850608825,
+      "num_tokens": 25065368.0,
+      "step": 1220
+    },
+    {
+      "entropy": 0.865056723356247,
+      "epoch": 2.265193370165746,
+      "grad_norm": 0.8728818297386169,
+      "learning_rate": 9.174399499693027e-05,
+      "loss": 0.05016371011734009,
+      "mean_token_accuracy": 0.9807134211063385,
+      "num_tokens": 25270945.0,
+      "step": 1230
+    },
+    {
+      "entropy": 0.8642262935638427,
+      "epoch": 2.283609576427256,
+      "grad_norm": 1.0582489967346191,
+      "learning_rate": 9.157559538335703e-05,
+      "loss": 0.05316779017448425,
+      "mean_token_accuracy": 0.9794209063053131,
+      "num_tokens": 25476575.0,
+      "step": 1240
+    },
+    {
+      "entropy": 0.8677761554718018,
+      "epoch": 2.3020257826887662,
+      "grad_norm": 0.760109543800354,
+      "learning_rate": 9.140565345079901e-05,
+      "loss": 0.05115479230880737,
+      "mean_token_accuracy": 0.9802310705184937,
+      "num_tokens": 25682814.0,
+      "step": 1250
+    },
+    {
+      "entropy": 0.8592945456504821,
+      "epoch": 2.320441988950276,
+      "grad_norm": 0.6537907123565674,
+      "learning_rate": 9.123417550354761e-05,
+      "loss": 0.050543540716171266,
+      "mean_token_accuracy": 0.9806945025920868,
+      "num_tokens": 25887575.0,
+      "step": 1260
+    },
+    {
+      "entropy": 0.8692500293254852,
+      "epoch": 2.3388581952117864,
+      "grad_norm": 0.7771905064582825,
+      "learning_rate": 9.106116790287541e-05,
+      "loss": 0.049718713760375975,
+      "mean_token_accuracy": 0.9805168390274048,
+      "num_tokens": 26092950.0,
+      "step": 1270
+    },
+    {
+      "entropy": 0.8841261565685272,
+      "epoch": 2.3572744014732967,
+      "grad_norm": 0.7791076898574829,
+      "learning_rate": 9.08866370668001e-05,
+      "loss": 0.0527400553226471,
+      "mean_token_accuracy": 0.9796754539012908,
+      "num_tokens": 26298182.0,
+      "step": 1280
+    },
+    {
+      "entropy": 0.8675022900104523,
+      "epoch": 2.3756906077348066,
+      "grad_norm": 0.8481605648994446,
+      "learning_rate": 9.07105894698464e-05,
+      "loss": 0.05320838689804077,
+      "mean_token_accuracy": 0.9792274832725525,
+      "num_tokens": 26503425.0,
+      "step": 1290
+    },
+    {
+      "entropy": 0.8704026222229004,
+      "epoch": 2.394106813996317,
+      "grad_norm": 0.8235505819320679,
+      "learning_rate": 9.053303164280602e-05,
+      "loss": 0.055045205354690555,
+      "mean_token_accuracy": 0.9788750648498535,
+      "num_tokens": 26708755.0,
+      "step": 1300
+    },
+    {
+      "entropy": 0.8525134027004242,
+      "epoch": 2.4125230202578267,
+      "grad_norm": 0.7611598968505859,
+      "learning_rate": 9.035397017249518e-05,
+      "loss": 0.05029621124267578,
+      "mean_token_accuracy": 0.9802757322788238,
+      "num_tokens": 26914704.0,
+      "step": 1310
+    },
+    {
+      "entropy": 0.8630305290222168,
+      "epoch": 2.430939226519337,
+      "grad_norm": 0.790408194065094,
+      "learning_rate": 9.017341170151041e-05,
+      "loss": 0.04856040775775909,
+      "mean_token_accuracy": 0.9809690833091735,
+      "num_tokens": 27120151.0,
+      "step": 1320
+    },
+    {
+      "entropy": 0.8579159140586853,
+      "epoch": 2.4493554327808473,
+      "grad_norm": 0.781972348690033,
+      "learning_rate": 8.999136292798207e-05,
+      "loss": 0.04869682788848877,
+      "mean_token_accuracy": 0.9816130697727203,
+      "num_tokens": 27325673.0,
+      "step": 1330
+    },
+    {
+      "entropy": 0.8634716987609863,
+      "epoch": 2.467771639042357,
+      "grad_norm": 0.8500784039497375,
+      "learning_rate": 8.980783060532588e-05,
+      "loss": 0.05050289034843445,
+      "mean_token_accuracy": 0.980079609155655,
+      "num_tokens": 27531270.0,
+      "step": 1340
+    },
+    {
+      "entropy": 0.8660618126392364,
+      "epoch": 2.4861878453038675,
+      "grad_norm": 0.719760537147522,
+      "learning_rate": 8.96228215419924e-05,
+      "loss": 0.04892141819000244,
+      "mean_token_accuracy": 0.9814020991325378,
+      "num_tokens": 27736542.0,
+      "step": 1350
+    },
+    {
+      "entropy": 0.8572284400463104,
+      "epoch": 2.5046040515653774,
+      "grad_norm": 1.0197229385375977,
+      "learning_rate": 8.943634260121442e-05,
+      "loss": 0.05104702711105347,
+      "mean_token_accuracy": 0.9798846662044525,
+      "num_tokens": 27941566.0,
+      "step": 1360
+    },
+    {
+      "entropy": 0.8702241241931915,
+      "epoch": 2.5230202578268877,
+      "grad_norm": 0.7136003375053406,
+      "learning_rate": 8.924840070075247e-05,
+      "loss": 0.04855787754058838,
+      "mean_token_accuracy": 0.9811685383319855,
+      "num_tokens": 28146943.0,
+      "step": 1370
+    },
+    {
+      "entropy": 0.874957013130188,
+      "epoch": 2.541436464088398,
+      "grad_norm": 0.8775497674942017,
+      "learning_rate": 8.905900281263804e-05,
+      "loss": 0.052434295415878296,
+      "mean_token_accuracy": 0.9795438170433044,
+      "num_tokens": 28352640.0,
+      "step": 1380
+    },
+    {
+      "entropy": 0.8776536166667939,
+      "epoch": 2.559852670349908,
+      "grad_norm": 0.8895741105079651,
+      "learning_rate": 8.8868155962915e-05,
+      "loss": 0.05282890796661377,
+      "mean_token_accuracy": 0.9790538609027862,
+      "num_tokens": 28558153.0,
+      "step": 1390
+    },
+    {
+      "entropy": 0.8738743245601654,
+      "epoch": 2.578268876611418,
+      "grad_norm": 0.788800060749054,
+      "learning_rate": 8.867586723137906e-05,
+      "loss": 0.048841872811317445,
+      "mean_token_accuracy": 0.9809149026870727,
+      "num_tokens": 28763613.0,
+      "step": 1400
+    },
+    {
+      "entropy": 0.8750253796577454,
+      "epoch": 2.596685082872928,
+      "grad_norm": 0.8738002777099609,
+      "learning_rate": 8.848214375131497e-05,
+      "loss": 0.048261132836341855,
+      "mean_token_accuracy": 0.980789190530777,
+      "num_tokens": 28969248.0,
+      "step": 1410
+    },
+    {
+      "entropy": 0.8624245524406433,
+      "epoch": 2.6151012891344383,
+      "grad_norm": 0.6404895186424255,
+      "learning_rate": 8.828699270923196e-05,
+      "loss": 0.04970468282699585,
+      "mean_token_accuracy": 0.9807762265205383,
+      "num_tokens": 29174779.0,
+      "step": 1420
+    },
+    {
+      "entropy": 0.8792938470840455,
+      "epoch": 2.6335174953959486,
+      "grad_norm": 0.7856965661048889,
+      "learning_rate": 8.80904213445972e-05,
+      "loss": 0.053334391117095946,
+      "mean_token_accuracy": 0.9790222108364105,
+      "num_tokens": 29380474.0,
+      "step": 1430
+    },
+    {
+      "entropy": 0.8831034600734711,
+      "epoch": 2.6519337016574585,
+      "grad_norm": 0.7739618420600891,
+      "learning_rate": 8.789243694956716e-05,
+      "loss": 0.04959054589271546,
+      "mean_token_accuracy": 0.9803965091705322,
+      "num_tokens": 29585985.0,
+      "step": 1440
+    },
+    {
+      "entropy": 0.8934672951698304,
+      "epoch": 2.6703499079189688,
+      "grad_norm": 0.6999697089195251,
+      "learning_rate": 8.769304686871719e-05,
+      "loss": 0.05165250301361084,
+      "mean_token_accuracy": 0.9798884153366089,
+      "num_tokens": 29791238.0,
+      "step": 1450
+    },
+    {
+      "entropy": 0.9053199410438537,
+      "epoch": 2.6887661141804786,
+      "grad_norm": 0.9199564456939697,
+      "learning_rate": 8.749225849876892e-05,
+      "loss": 0.04924143850803375,
+      "mean_token_accuracy": 0.9810785710811615,
+      "num_tokens": 29996589.0,
+      "step": 1460
+    },
+    {
+      "entropy": 0.888091403245926,
+      "epoch": 2.707182320441989,
+      "grad_norm": 0.7480106353759766,
+      "learning_rate": 8.729007928831597e-05,
+      "loss": 0.04948916733264923,
+      "mean_token_accuracy": 0.9809579730033875,
+      "num_tokens": 30201875.0,
+      "step": 1470
+    },
+    {
+      "entropy": 0.8723407983779907,
+      "epoch": 2.7255985267034992,
+      "grad_norm": 0.9506945013999939,
+      "learning_rate": 8.708651673754763e-05,
+      "loss": 0.048927539587020875,
+      "mean_token_accuracy": 0.980553150177002,
+      "num_tokens": 30407550.0,
+      "step": 1480
+    },
+    {
+      "entropy": 0.8737521529197693,
+      "epoch": 2.744014732965009,
+      "grad_norm": 0.8015706539154053,
+      "learning_rate": 8.688157839797062e-05,
+      "loss": 0.04963063597679138,
+      "mean_token_accuracy": 0.9809738755226135,
+      "num_tokens": 30612839.0,
+      "step": 1490
+    },
+    {
+      "entropy": 0.8800762951374054,
+      "epoch": 2.7624309392265194,
+      "grad_norm": 0.9429986476898193,
+      "learning_rate": 8.667527187212885e-05,
+      "loss": 0.0524174690246582,
+      "mean_token_accuracy": 0.9788767337799072,
+      "num_tokens": 30818578.0,
+      "step": 1500
+    },
+    {
+      "entropy": 0.8871055901050567,
+      "epoch": 2.7808471454880292,
+      "grad_norm": 0.5909196138381958,
+      "learning_rate": 8.646760481332157e-05,
+      "loss": 0.05166680812835693,
+      "mean_token_accuracy": 0.980216771364212,
+      "num_tokens": 31023829.0,
+      "step": 1510
+    },
+    {
+      "entropy": 0.8908755779266357,
+      "epoch": 2.7992633517495396,
+      "grad_norm": 0.9154611229896545,
+      "learning_rate": 8.625858492531931e-05,
+      "loss": 0.04951836466789246,
+      "mean_token_accuracy": 0.9801484227180481,
+      "num_tokens": 31229635.0,
+      "step": 1520
+    },
+    {
+      "entropy": 0.92480548620224,
+      "epoch": 2.81767955801105,
+      "grad_norm": 0.5989938378334045,
+      "learning_rate": 8.604821996207819e-05,
+      "loss": 0.04799881279468536,
+      "mean_token_accuracy": 0.9817522585391998,
+      "num_tokens": 31435456.0,
+      "step": 1530
+    },
+    {
+      "entropy": 0.9173881888389588,
+      "epoch": 2.8360957642725597,
+      "grad_norm": 0.899413526058197,
+      "learning_rate": 8.58365177274522e-05,
+      "loss": 0.0487445592880249,
+      "mean_token_accuracy": 0.9812625288963318,
+      "num_tokens": 31640904.0,
+      "step": 1540
+    },
+    {
+      "entropy": 0.9076135993003845,
+      "epoch": 2.85451197053407,
+      "grad_norm": 0.8494166135787964,
+      "learning_rate": 8.562348607490376e-05,
+      "loss": 0.05005228519439697,
+      "mean_token_accuracy": 0.9806681036949157,
+      "num_tokens": 31845807.0,
+      "step": 1550
+    },
+    {
+      "entropy": 0.9092245221138,
+      "epoch": 2.87292817679558,
+      "grad_norm": 0.8225123286247253,
+      "learning_rate": 8.540913290721234e-05,
+      "loss": 0.048654764890670776,
+      "mean_token_accuracy": 0.9805659353733063,
+      "num_tokens": 32051523.0,
+      "step": 1560
+    },
+    {
+      "entropy": 0.9062779664993286,
+      "epoch": 2.89134438305709,
+      "grad_norm": 0.7074014544487,
+      "learning_rate": 8.519346617618134e-05,
+      "loss": 0.049209845066070554,
+      "mean_token_accuracy": 0.9807434439659118,
+      "num_tokens": 32256895.0,
+      "step": 1570
+    },
+    {
+      "entropy": 0.9190246641635895,
+      "epoch": 2.9097605893186005,
+      "grad_norm": 0.8860642910003662,
+      "learning_rate": 8.497649388234304e-05,
+      "loss": 0.051211881637573245,
+      "mean_token_accuracy": 0.9802342295646668,
+      "num_tokens": 32462031.0,
+      "step": 1580
+    },
+    {
+      "entropy": 0.9088015079498291,
+      "epoch": 2.9281767955801103,
+      "grad_norm": 0.8062726855278015,
+      "learning_rate": 8.475822407466188e-05,
+      "loss": 0.053512704372406,
+      "mean_token_accuracy": 0.979486483335495,
+      "num_tokens": 32667533.0,
+      "step": 1590
+    },
+    {
+      "entropy": 0.9462027847766876,
+      "epoch": 2.9465930018416207,
+      "grad_norm": 0.7962909936904907,
+      "learning_rate": 8.453866485023579e-05,
+      "loss": 0.0501457154750824,
+      "mean_token_accuracy": 0.9803222417831421,
+      "num_tokens": 32872900.0,
+      "step": 1600
+    },
+    {
+      "entropy": 0.9671471297740937,
+      "epoch": 2.9650092081031305,
+      "grad_norm": 0.7641744017601013,
+      "learning_rate": 8.431782435399587e-05,
+      "loss": 0.04629061222076416,
+      "mean_token_accuracy": 0.9823175370693207,
+      "num_tokens": 33077850.0,
+      "step": 1610
+    },
+    {
+      "entropy": 0.955865204334259,
+      "epoch": 2.983425414364641,
+      "grad_norm": 0.6772348880767822,
+      "learning_rate": 8.409571077840426e-05,
+      "loss": 0.048368623852729796,
+      "mean_token_accuracy": 0.9808700799942016,
+      "num_tokens": 33283117.0,
+      "step": 1620
+    },
+    {
+      "epoch": 3.0,
+      "eval_entropy": 0.9563225186389426,
+      "eval_loss": 0.059064481407403946,
+      "eval_mean_token_accuracy": 0.9773589429648026,
+      "eval_num_tokens": 33467712.0,
+      "eval_runtime": 10.1471,
+      "eval_samples_per_second": 360.499,
+      "eval_steps_per_second": 11.333,
+      "step": 1629
+    },
+    {
+      "entropy": 0.9337226033210755,
+      "epoch": 3.001841620626151,
+      "grad_norm": 0.646203875541687,
+      "learning_rate": 8.387233236315016e-05,
+      "loss": 0.043352216482162476,
+      "mean_token_accuracy": 0.9830620110034942,
+      "num_tokens": 33488302.0,
+      "step": 1630
+    },
+    {
+      "entropy": 0.9734923839569092,
+      "epoch": 3.020257826887661,
+      "grad_norm": 0.7564226984977722,
+      "learning_rate": 8.364769739484416e-05,
+      "loss": 0.033932483196258544,
+      "mean_token_accuracy": 0.9872806966304779,
+      "num_tokens": 33693531.0,
+      "step": 1640
+    },
+    {
+      "entropy": 0.9669206500053406,
+      "epoch": 3.0386740331491713,
+      "grad_norm": 0.7126886248588562,
+      "learning_rate": 8.342181420671096e-05,
+      "loss": 0.03818287253379822,
+      "mean_token_accuracy": 0.9852082908153534,
+      "num_tokens": 33899305.0,
+      "step": 1650
+    },
+    {
+      "entropy": 0.9522916138172149,
+      "epoch": 3.0570902394106816,
+      "grad_norm": 1.0571653842926025,
+      "learning_rate": 8.319469117828007e-05,
+      "loss": 0.03456039130687714,
+      "mean_token_accuracy": 0.9867027878761292,
+      "num_tokens": 34104585.0,
+      "step": 1660
+    },
+    {
+      "entropy": 0.9568560004234314,
+      "epoch": 3.0755064456721914,
+      "grad_norm": 0.780940592288971,
+      "learning_rate": 8.296633673507505e-05,
+      "loss": 0.03551802039146423,
+      "mean_token_accuracy": 0.9867531359195709,
+      "num_tokens": 34309516.0,
+      "step": 1670
+    },
+    {
+      "entropy": 0.9590656876564025,
+      "epoch": 3.0939226519337018,
+      "grad_norm": 0.8330219388008118,
+      "learning_rate": 8.273675934830094e-05,
+      "loss": 0.03674865961074829,
+      "mean_token_accuracy": 0.9864118576049805,
+      "num_tokens": 34515170.0,
+      "step": 1680
+    },
+    {
+      "entropy": 0.975881814956665,
+      "epoch": 3.1123388581952116,
+      "grad_norm": 0.7010637521743774,
+      "learning_rate": 8.250596753453e-05,
+      "loss": 0.03550414443016052,
+      "mean_token_accuracy": 0.9864102602005005,
+      "num_tokens": 34720896.0,
+      "step": 1690
+    },
+    {
+      "entropy": 0.9599562883377075,
+      "epoch": 3.130755064456722,
+      "grad_norm": 0.6694278717041016,
+      "learning_rate": 8.227396985538578e-05,
+      "loss": 0.035564273595809937,
+      "mean_token_accuracy": 0.9867321848869324,
+      "num_tokens": 34925970.0,
+      "step": 1700
+    },
+    {
+      "entropy": 0.9582216143608093,
+      "epoch": 3.149171270718232,
+      "grad_norm": 0.9333199262619019,
+      "learning_rate": 8.204077491722546e-05,
+      "loss": 0.035575729608535764,
+      "mean_token_accuracy": 0.9862452208995819,
+      "num_tokens": 35131543.0,
+      "step": 1710
+    },
+    {
+      "entropy": 0.9579678058624268,
+      "epoch": 3.167587476979742,
+      "grad_norm": 0.9450218081474304,
+      "learning_rate": 8.180639137082066e-05,
+      "loss": 0.0385298490524292,
+      "mean_token_accuracy": 0.98538036942482,
+      "num_tokens": 35336790.0,
+      "step": 1720
+    },
+    {
+      "entropy": 0.9640831351280212,
+      "epoch": 3.1860036832412524,
+      "grad_norm": 0.8551534414291382,
+      "learning_rate": 8.157082791103649e-05,
+      "loss": 0.03702138364315033,
+      "mean_token_accuracy": 0.9852015495300293,
+      "num_tokens": 35542294.0,
+      "step": 1730
+    },
+    {
+      "entropy": 0.9867071211338043,
+      "epoch": 3.2044198895027622,
+      "grad_norm": 0.7138128876686096,
+      "learning_rate": 8.133409327650897e-05,
+      "loss": 0.035626694560050964,
+      "mean_token_accuracy": 0.986064875125885,
+      "num_tokens": 35747447.0,
+      "step": 1740
+    },
+    {
+      "entropy": 0.9639089345932007,
+      "epoch": 3.2228360957642725,
+      "grad_norm": 0.7131415009498596,
+      "learning_rate": 8.109619624932092e-05,
+      "loss": 0.035885071754455565,
+      "mean_token_accuracy": 0.986273056268692,
+      "num_tokens": 35952258.0,
+      "step": 1750
+    },
+    {
+      "entropy": 0.9516046345233917,
+      "epoch": 3.241252302025783,
+      "grad_norm": 0.6900200843811035,
+      "learning_rate": 8.085714565467611e-05,
+      "loss": 0.03535219430923462,
+      "mean_token_accuracy": 0.985836285352707,
+      "num_tokens": 36157938.0,
+      "step": 1760
+    },
+    {
+      "entropy": 0.9373646557331086,
+      "epoch": 3.2596685082872927,
+      "grad_norm": 0.6101690530776978,
+      "learning_rate": 8.061695036057191e-05,
+      "loss": 0.034940996766090394,
+      "mean_token_accuracy": 0.9863743901252746,
+      "num_tokens": 36363825.0,
+      "step": 1770
+    },
+    {
+      "entropy": 0.9444344758987426,
+      "epoch": 3.278084714548803,
+      "grad_norm": 0.7518529295921326,
+      "learning_rate": 8.03756192774703e-05,
+      "loss": 0.03404279053211212,
+      "mean_token_accuracy": 0.9866396844387054,
+      "num_tokens": 36568961.0,
+      "step": 1780
+    },
+    {
+      "entropy": 0.9550357758998871,
+      "epoch": 3.2965009208103133,
+      "grad_norm": 0.7687555551528931,
+      "learning_rate": 8.013316135796734e-05,
+      "loss": 0.038447052240371704,
+      "mean_token_accuracy": 0.985325163602829,
+      "num_tokens": 36774514.0,
+      "step": 1790
+    },
+    {
+      "entropy": 0.9477231681346894,
+      "epoch": 3.314917127071823,
+      "grad_norm": 0.7521633505821228,
+      "learning_rate": 7.988958559646102e-05,
+      "loss": 0.03746694028377533,
+      "mean_token_accuracy": 0.9853165090084076,
+      "num_tokens": 36979660.0,
+      "step": 1800
+    },
+    {
+      "entropy": 0.925805002450943,
+      "epoch": 3.3333333333333335,
+      "grad_norm": 0.9333297610282898,
+      "learning_rate": 7.964490102881768e-05,
+      "loss": 0.03700103759765625,
+      "mean_token_accuracy": 0.9850880861282348,
+      "num_tokens": 37185191.0,
+      "step": 1810
+    },
+    {
+      "entropy": 0.9225482225418091,
+      "epoch": 3.3517495395948433,
+      "grad_norm": 0.7928622961044312,
+      "learning_rate": 7.939911673203665e-05,
+      "loss": 0.03825801610946655,
+      "mean_token_accuracy": 0.9850241422653199,
+      "num_tokens": 37390749.0,
+      "step": 1820
+    },
+    {
+      "entropy": 0.9597147881984711,
+      "epoch": 3.3701657458563536,
+      "grad_norm": 0.7658583521842957,
+      "learning_rate": 7.915224182391375e-05,
+      "loss": 0.039855146408081056,
+      "mean_token_accuracy": 0.9845879554748536,
+      "num_tokens": 37596052.0,
+      "step": 1830
+    },
+    {
+      "entropy": 0.9485619068145752,
+      "epoch": 3.388581952117864,
+      "grad_norm": 0.8492130637168884,
+      "learning_rate": 7.890428546270278e-05,
+      "loss": 0.039359599351882935,
+      "mean_token_accuracy": 0.9847265422344208,
+      "num_tokens": 37802063.0,
+      "step": 1840
+    },
+    {
+      "entropy": 0.9670301914215088,
+      "epoch": 3.406998158379374,
+      "grad_norm": 0.7527599930763245,
+      "learning_rate": 7.865525684677608e-05,
+      "loss": 0.03752985596656799,
+      "mean_token_accuracy": 0.9855137526988983,
+      "num_tokens": 38007432.0,
+      "step": 1850
+    },
+    {
+      "entropy": 0.9681244969367981,
+      "epoch": 3.425414364640884,
+      "grad_norm": 0.7599612474441528,
+      "learning_rate": 7.840516521428303e-05,
+      "loss": 0.03653894364833832,
+      "mean_token_accuracy": 0.9858933389186859,
+      "num_tokens": 38212923.0,
+      "step": 1860
+    },
+    {
+      "entropy": 0.9706049561500549,
+      "epoch": 3.443830570902394,
+      "grad_norm": 0.7678127884864807,
+      "learning_rate": 7.815401984280748e-05,
+      "loss": 0.0366938978433609,
+      "mean_token_accuracy": 0.9854713797569274,
+      "num_tokens": 38418422.0,
+      "step": 1870
+    },
+    {
+      "entropy": 0.9637093842029572,
+      "epoch": 3.4622467771639043,
+      "grad_norm": 0.762824535369873,
+      "learning_rate": 7.790183004902359e-05,
+      "loss": 0.03516915142536163,
+      "mean_token_accuracy": 0.9866003453731537,
+      "num_tokens": 38624389.0,
+      "step": 1880
+    },
+    {
+      "entropy": 0.9373565018177032,
+      "epoch": 3.4806629834254146,
+      "grad_norm": 0.8221780061721802,
+      "learning_rate": 7.764860518835014e-05,
+      "loss": 0.04049026966094971,
+      "mean_token_accuracy": 0.984089481830597,
+      "num_tokens": 38829654.0,
+      "step": 1890
+    },
+    {
+      "entropy": 0.9356025457382202,
+      "epoch": 3.4990791896869244,
+      "grad_norm": 0.7583426237106323,
+      "learning_rate": 7.739435465460356e-05,
+      "loss": 0.03658481240272522,
+      "mean_token_accuracy": 0.9857318818569183,
+      "num_tokens": 39034638.0,
+      "step": 1900
+    },
+    {
+      "entropy": 0.9740163326263428,
+      "epoch": 3.5174953959484347,
+      "grad_norm": 0.7332878112792969,
+      "learning_rate": 7.713908787964937e-05,
+      "loss": 0.03508963882923126,
+      "mean_token_accuracy": 0.9863419532775879,
+      "num_tokens": 39240265.0,
+      "step": 1910
+    },
+    {
+      "entropy": 0.9528286933898926,
+      "epoch": 3.5359116022099446,
+      "grad_norm": 0.6515451669692993,
+      "learning_rate": 7.688281433305233e-05,
+      "loss": 0.036055779457092284,
+      "mean_token_accuracy": 0.9860979080200195,
+      "num_tokens": 39445546.0,
+      "step": 1920
+    },
+    {
+      "entropy": 0.9480705261230469,
+      "epoch": 3.554327808471455,
+      "grad_norm": 0.7725827097892761,
+      "learning_rate": 7.662554352172515e-05,
+      "loss": 0.037101513147354125,
+      "mean_token_accuracy": 0.985782790184021,
+      "num_tokens": 39651078.0,
+      "step": 1930
+    },
+    {
+      "entropy": 0.9655321061611175,
+      "epoch": 3.572744014732965,
+      "grad_norm": 0.7756506204605103,
+      "learning_rate": 7.636728498957581e-05,
+      "loss": 0.03721855878829956,
+      "mean_token_accuracy": 0.9857951939105988,
+      "num_tokens": 39856542.0,
+      "step": 1940
+    },
+    {
+      "entropy": 0.9772682309150695,
+      "epoch": 3.591160220994475,
+      "grad_norm": 0.9084987640380859,
+      "learning_rate": 7.610804831715355e-05,
+      "loss": 0.03570749163627625,
+      "mean_token_accuracy": 0.9863450109958649,
+      "num_tokens": 40061913.0,
+      "step": 1950
+    },
+    {
+      "entropy": 0.9579685389995575,
+      "epoch": 3.6095764272559854,
+      "grad_norm": 0.6358487606048584,
+      "learning_rate": 7.584784312129334e-05,
+      "loss": 0.038210684061050416,
+      "mean_token_accuracy": 0.9850837290287018,
+      "num_tokens": 40267398.0,
+      "step": 1960
+    },
+    {
+      "entropy": 0.9605201721191406,
+      "epoch": 3.6279926335174952,
+      "grad_norm": 0.6263149976730347,
+      "learning_rate": 7.558667905475927e-05,
+      "loss": 0.03509160876274109,
+      "mean_token_accuracy": 0.9868143379688263,
+      "num_tokens": 40472827.0,
+      "step": 1970
+    },
+    {
+      "entropy": 0.964026153087616,
+      "epoch": 3.6464088397790055,
+      "grad_norm": 0.90068119764328,
+      "learning_rate": 7.532456580588638e-05,
+      "loss": 0.036211782693862916,
+      "mean_token_accuracy": 0.9858468770980835,
+      "num_tokens": 40677935.0,
+      "step": 1980
+    },
+    {
+      "entropy": 0.9494135618209839,
+      "epoch": 3.664825046040516,
+      "grad_norm": 0.760134756565094,
+      "learning_rate": 7.50615130982213e-05,
+      "loss": 0.03786201477050781,
+      "mean_token_accuracy": 0.9852500438690186,
+      "num_tokens": 40883750.0,
+      "step": 1990
+    },
+    {
+      "entropy": 0.9527071297168732,
+      "epoch": 3.6832412523020257,
+      "grad_norm": 0.9812107682228088,
+      "learning_rate": 7.479753069016152e-05,
+      "loss": 0.03803159594535828,
+      "mean_token_accuracy": 0.9852405369281769,
+      "num_tokens": 41089115.0,
+      "step": 2000
+    },
+    {
+      "entropy": 0.9639330863952636,
+      "epoch": 3.701657458563536,
+      "grad_norm": 0.7164933681488037,
+      "learning_rate": 7.453262837459332e-05,
+      "loss": 0.03912568986415863,
+      "mean_token_accuracy": 0.9849458575248718,
+      "num_tokens": 41294694.0,
+      "step": 2010
+    },
+    {
+      "entropy": 0.9536987483501435,
+      "epoch": 3.720073664825046,
+      "grad_norm": 0.6804596185684204,
+      "learning_rate": 7.426681597852863e-05,
+      "loss": 0.036410006880760196,
+      "mean_token_accuracy": 0.985712206363678,
+      "num_tokens": 41499817.0,
+      "step": 2020
+    },
+    {
+      "entropy": 0.9478164672851562,
+      "epoch": 3.738489871086556,
+      "grad_norm": 0.8799397349357605,
+      "learning_rate": 7.400010336274037e-05,
+      "loss": 0.03801035583019256,
+      "mean_token_accuracy": 0.9850274682044983,
+      "num_tokens": 41704932.0,
+      "step": 2030
+    },
+    {
+      "entropy": 0.9383447647094727,
+      "epoch": 3.7569060773480665,
+      "grad_norm": 0.8386216163635254,
+      "learning_rate": 7.373250042139664e-05,
+      "loss": 0.0373637855052948,
+      "mean_token_accuracy": 0.9854822158813477,
+      "num_tokens": 41910804.0,
+      "step": 2040
+    },
+    {
+      "entropy": 0.925172996520996,
+      "epoch": 3.7753222836095763,
+      "grad_norm": 0.7599324584007263,
+      "learning_rate": 7.346401708169377e-05,
+      "loss": 0.03585260808467865,
+      "mean_token_accuracy": 0.9860672950744629,
+      "num_tokens": 42116706.0,
+      "step": 2050
+    },
+    {
+      "entropy": 0.9463765442371368,
+      "epoch": 3.7937384898710866,
+      "grad_norm": 0.9030149579048157,
+      "learning_rate": 7.319466330348797e-05,
+      "loss": 0.035877206921577455,
+      "mean_token_accuracy": 0.9863968968391419,
+      "num_tokens": 42322670.0,
+      "step": 2060
+    },
+    {
+      "entropy": 0.9942441761493683,
+      "epoch": 3.8121546961325965,
+      "grad_norm": 0.6400449275970459,
+      "learning_rate": 7.292444907892587e-05,
+      "loss": 0.037310433387756345,
+      "mean_token_accuracy": 0.9854151606559753,
+      "num_tokens": 42527752.0,
+      "step": 2070
+    },
+    {
+      "entropy": 0.9577703952789307,
+      "epoch": 3.830570902394107,
+      "grad_norm": 0.6193167567253113,
+      "learning_rate": 7.265338443207387e-05,
+      "loss": 0.03648848831653595,
+      "mean_token_accuracy": 0.9856530070304871,
+      "num_tokens": 42732981.0,
+      "step": 2080
+    },
+    {
+      "entropy": 0.9663952767848969,
+      "epoch": 3.848987108655617,
+      "grad_norm": 0.759611189365387,
+      "learning_rate": 7.238147941854625e-05,
+      "loss": 0.036112996935844424,
+      "mean_token_accuracy": 0.9862765550613404,
+      "num_tokens": 42938619.0,
+      "step": 2090
+    },
+    {
+      "entropy": 0.9484863519668579,
+      "epoch": 3.867403314917127,
+      "grad_norm": 0.7420705556869507,
+      "learning_rate": 7.210874412513218e-05,
+      "loss": 0.03703283965587616,
+      "mean_token_accuracy": 0.9857317566871643,
+      "num_tokens": 43143753.0,
+      "step": 2100
+    },
+    {
+      "entropy": 0.964326673746109,
+      "epoch": 3.8858195211786373,
+      "grad_norm": 0.8779639601707458,
+      "learning_rate": 7.183518866942147e-05,
+      "loss": 0.03739701807498932,
+      "mean_token_accuracy": 0.9852154791355133,
+      "num_tokens": 43349451.0,
+      "step": 2110
+    },
+    {
+      "entropy": 0.9729791641235351,
+      "epoch": 3.904235727440147,
+      "grad_norm": 0.7582741379737854,
+      "learning_rate": 7.156082319942929e-05,
+      "loss": 0.03894525766372681,
+      "mean_token_accuracy": 0.9847454309463501,
+      "num_tokens": 43554598.0,
+      "step": 2120
+    },
+    {
+      "entropy": 0.9860592544078827,
+      "epoch": 3.9226519337016574,
+      "grad_norm": 0.860698938369751,
+      "learning_rate": 7.128565789321969e-05,
+      "loss": 0.0365300178527832,
+      "mean_token_accuracy": 0.9859121859073638,
+      "num_tokens": 43760081.0,
+      "step": 2130
+    },
+    {
+      "entropy": 0.9916551172733307,
+      "epoch": 3.9410681399631677,
+      "grad_norm": 0.8363776206970215,
+      "learning_rate": 7.100970295852805e-05,
+      "loss": 0.036221379041671754,
+      "mean_token_accuracy": 0.9859034180641174,
+      "num_tokens": 43965432.0,
+      "step": 2140
+    },
+    {
+      "entropy": 0.9553558886051178,
+      "epoch": 3.9594843462246776,
+      "grad_norm": 0.9627474546432495,
+      "learning_rate": 7.073296863238242e-05,
+      "loss": 0.03684481382369995,
+      "mean_token_accuracy": 0.9857315957546234,
+      "num_tokens": 44171232.0,
+      "step": 2150
+    },
+    {
+      "entropy": 0.9538035809993743,
+      "epoch": 3.977900552486188,
+      "grad_norm": 0.8399474620819092,
+      "learning_rate": 7.045546518072366e-05,
+      "loss": 0.03825397789478302,
+      "mean_token_accuracy": 0.9846831560134888,
+      "num_tokens": 44376723.0,
+      "step": 2160
+    },
+    {
+      "entropy": 0.9476235210895538,
+      "epoch": 3.9963167587476978,
+      "grad_norm": 0.708739697933197,
+      "learning_rate": 7.017720289802472e-05,
+      "loss": 0.03618018329143524,
+      "mean_token_accuracy": 0.9861325800418854,
+      "num_tokens": 44582407.0,
+      "step": 2170
+    },
+    {
+      "epoch": 4.0,
+      "eval_entropy": 0.9569619194321011,
+      "eval_loss": 0.059838198125362396,
+      "eval_mean_token_accuracy": 0.9777795366618944,
+      "eval_num_tokens": 44623647.0,
+      "eval_runtime": 10.0379,
+      "eval_samples_per_second": 364.42,
+      "eval_steps_per_second": 11.457,
+      "step": 2172
+    },
+    {
+      "entropy": 0.9558675646781921,
+      "epoch": 4.014732965009208,
+      "grad_norm": 0.7347508668899536,
+      "learning_rate": 6.989819210690872e-05,
+      "loss": 0.02886659502983093,
+      "mean_token_accuracy": 0.9892994821071625,
+      "num_tokens": 44788219.0,
+      "step": 2180
+    },
+    {
+      "entropy": 1.0037677466869355,
+      "epoch": 4.033149171270718,
+      "grad_norm": 0.7403206825256348,
+      "learning_rate": 6.961844315776596e-05,
+      "loss": 0.02395295798778534,
+      "mean_token_accuracy": 0.9906026899814606,
+      "num_tokens": 44993505.0,
+      "step": 2190
+    },
+    {
+      "entropy": 1.0068290829658508,
+      "epoch": 4.051565377532229,
+      "grad_norm": 0.7979726195335388,
+      "learning_rate": 6.933796642837003e-05,
+      "loss": 0.02605988085269928,
+      "mean_token_accuracy": 0.9899706900119781,
+      "num_tokens": 45199193.0,
+      "step": 2200
+    },
+    {
+      "entropy": 0.9942211747169495,
+      "epoch": 4.069981583793738,
+      "grad_norm": 0.6460402011871338,
+      "learning_rate": 6.905677232349278e-05,
+      "loss": 0.025350230932235717,
+      "mean_token_accuracy": 0.9899386286735534,
+      "num_tokens": 45404030.0,
+      "step": 2210
+    },
+    {
+      "entropy": 0.9783595442771912,
+      "epoch": 4.088397790055248,
+      "grad_norm": 0.8177055716514587,
+      "learning_rate": 6.877487127451834e-05,
+      "loss": 0.02696993052959442,
+      "mean_token_accuracy": 0.9896106541156768,
+      "num_tokens": 45609763.0,
+      "step": 2220
+    },
+    {
+      "entropy": 0.9801763832569123,
+      "epoch": 4.106813996316759,
+      "grad_norm": 0.6608165502548218,
+      "learning_rate": 6.849227373905618e-05,
+      "loss": 0.025101393461227417,
+      "mean_token_accuracy": 0.9904372334480286,
+      "num_tokens": 45814941.0,
+      "step": 2230
+    },
+    {
+      "entropy": 0.9695689737796783,
+      "epoch": 4.125230202578269,
+      "grad_norm": 0.8036547899246216,
+      "learning_rate": 6.820899020055314e-05,
+      "loss": 0.027827343344688414,
+      "mean_token_accuracy": 0.9890337705612182,
+      "num_tokens": 46020535.0,
+      "step": 2240
+    },
+    {
+      "entropy": 0.9828635334968567,
+      "epoch": 4.143646408839779,
+      "grad_norm": 0.7729921936988831,
+      "learning_rate": 6.792503116790455e-05,
+      "loss": 0.02779492735862732,
+      "mean_token_accuracy": 0.9894372522830963,
+      "num_tokens": 46226013.0,
+      "step": 2250
+    },
+    {
+      "entropy": 0.9978842556476593,
+      "epoch": 4.162062615101289,
+      "grad_norm": 0.7334664463996887,
+      "learning_rate": 6.764040717506432e-05,
+      "loss": 0.025673511624336242,
+      "mean_token_accuracy": 0.9899355113506317,
+      "num_tokens": 46432087.0,
+      "step": 2260
+    },
+    {
+      "entropy": 1.0116403937339782,
+      "epoch": 4.180478821362799,
+      "grad_norm": 0.6769368052482605,
+      "learning_rate": 6.735512878065427e-05,
+      "loss": 0.024705511331558228,
+      "mean_token_accuracy": 0.9906128525733948,
+      "num_tokens": 46637478.0,
+      "step": 2270
+    },
+    {
+      "entropy": 0.9985016226768494,
+      "epoch": 4.198895027624309,
+      "grad_norm": 0.8301573991775513,
+      "learning_rate": 6.706920656757234e-05,
+      "loss": 0.02455987185239792,
+      "mean_token_accuracy": 0.9905728340148926,
+      "num_tokens": 46842562.0,
+      "step": 2280
+    },
+    {
+      "entropy": 0.9909430682659149,
+      "epoch": 4.21731123388582,
+      "grad_norm": 0.656026303768158,
+      "learning_rate": 6.67826511426001e-05,
+      "loss": 0.022711564600467683,
+      "mean_token_accuracy": 0.9910893619060517,
+      "num_tokens": 47048071.0,
+      "step": 2290
+    },
+    {
+      "entropy": 0.9868666052818298,
+      "epoch": 4.23572744014733,
+      "grad_norm": 0.7614991068840027,
+      "learning_rate": 6.649547313600916e-05,
+      "loss": 0.02453812211751938,
+      "mean_token_accuracy": 0.9908901154994965,
+      "num_tokens": 47253507.0,
+      "step": 2300
+    },
+    {
+      "entropy": 0.9870487153530121,
+      "epoch": 4.25414364640884,
+      "grad_norm": 0.7617276906967163,
+      "learning_rate": 6.62076832011669e-05,
+      "loss": 0.025818097591400146,
+      "mean_token_accuracy": 0.990347957611084,
+      "num_tokens": 47458747.0,
+      "step": 2310
+    },
+    {
+      "entropy": 0.9691080570220947,
+      "epoch": 4.27255985267035,
+      "grad_norm": 0.6743029952049255,
+      "learning_rate": 6.591929201414124e-05,
+      "loss": 0.02456912100315094,
+      "mean_token_accuracy": 0.9905289709568024,
+      "num_tokens": 47663643.0,
+      "step": 2320
+    },
+    {
+      "entropy": 0.9701108932495117,
+      "epoch": 4.29097605893186,
+      "grad_norm": 0.6964483261108398,
+      "learning_rate": 6.56303102733046e-05,
+      "loss": 0.02575681209564209,
+      "mean_token_accuracy": 0.9898503363132477,
+      "num_tokens": 47868982.0,
+      "step": 2330
+    },
+    {
+      "entropy": 0.969528192281723,
+      "epoch": 4.30939226519337,
+      "grad_norm": 0.7521987557411194,
+      "learning_rate": 6.5340748698937e-05,
+      "loss": 0.02678089737892151,
+      "mean_token_accuracy": 0.9898572087287902,
+      "num_tokens": 48074314.0,
+      "step": 2340
+    },
+    {
+      "entropy": 0.9921871721744537,
+      "epoch": 4.327808471454881,
+      "grad_norm": 0.6944513320922852,
+      "learning_rate": 6.505061803282844e-05,
+      "loss": 0.025553321838378905,
+      "mean_token_accuracy": 0.9907529592514038,
+      "num_tokens": 48279731.0,
+      "step": 2350
+    },
+    {
+      "entropy": 0.9768964886665344,
+      "epoch": 4.346224677716391,
+      "grad_norm": 0.6553092002868652,
+      "learning_rate": 6.47599290378803e-05,
+      "loss": 0.0250235915184021,
+      "mean_token_accuracy": 0.9904054701328278,
+      "num_tokens": 48485401.0,
+      "step": 2360
+    },
+    {
+      "entropy": 0.9612838506698609,
+      "epoch": 4.3646408839779,
+      "grad_norm": 0.916820228099823,
+      "learning_rate": 6.446869249770619e-05,
+      "loss": 0.028156182169914244,
+      "mean_token_accuracy": 0.9888657331466675,
+      "num_tokens": 48691047.0,
+      "step": 2370
+    },
+    {
+      "entropy": 0.9665832936763763,
+      "epoch": 4.383057090239411,
+      "grad_norm": 0.9197776913642883,
+      "learning_rate": 6.417691921623185e-05,
+      "loss": 0.025303921103477477,
+      "mean_token_accuracy": 0.989986252784729,
+      "num_tokens": 48896234.0,
+      "step": 2380
+    },
+    {
+      "entropy": 0.9686589121818543,
+      "epoch": 4.401473296500921,
+      "grad_norm": 0.8505764603614807,
+      "learning_rate": 6.388462001729434e-05,
+      "loss": 0.024816396832466125,
+      "mean_token_accuracy": 0.9909265041351318,
+      "num_tokens": 49101893.0,
+      "step": 2390
+    },
+    {
+      "entropy": 0.9625210344791413,
+      "epoch": 4.419889502762431,
+      "grad_norm": 1.0601766109466553,
+      "learning_rate": 6.359180574424062e-05,
+      "loss": 0.02706078290939331,
+      "mean_token_accuracy": 0.9895522117614746,
+      "num_tokens": 49307467.0,
+      "step": 2400
+    },
+    {
+      "entropy": 0.9679551541805267,
+      "epoch": 4.4383057090239415,
+      "grad_norm": 0.776253879070282,
+      "learning_rate": 6.329848725952514e-05,
+      "loss": 0.02693203091621399,
+      "mean_token_accuracy": 0.9893981635570526,
+      "num_tokens": 49513020.0,
+      "step": 2410
+    },
+    {
+      "entropy": 0.9704959928989411,
+      "epoch": 4.456721915285451,
+      "grad_norm": 0.5459668636322021,
+      "learning_rate": 6.3004675444307e-05,
+      "loss": 0.0279473751783371,
+      "mean_token_accuracy": 0.9894329369068146,
+      "num_tokens": 49718405.0,
+      "step": 2420
+    },
+    {
+      "entropy": 0.961863350868225,
+      "epoch": 4.475138121546961,
+      "grad_norm": 0.9338833093643188,
+      "learning_rate": 6.27103811980462e-05,
+      "loss": 0.026478803157806395,
+      "mean_token_accuracy": 0.9902269721031189,
+      "num_tokens": 49923375.0,
+      "step": 2430
+    },
+    {
+      "entropy": 0.9708506822586059,
+      "epoch": 4.4935543278084715,
+      "grad_norm": 0.9073707461357117,
+      "learning_rate": 6.241561543809947e-05,
+      "loss": 0.025289520621299744,
+      "mean_token_accuracy": 0.9904769957065582,
+      "num_tokens": 50128901.0,
+      "step": 2440
+    },
+    {
+      "entropy": 0.984996622800827,
+      "epoch": 4.511970534069982,
+      "grad_norm": 0.8674206733703613,
+      "learning_rate": 6.212038909931503e-05,
+      "loss": 0.026442551612854005,
+      "mean_token_accuracy": 0.9905101835727692,
+      "num_tokens": 50334449.0,
+      "step": 2450
+    },
+    {
+      "entropy": 0.9926377475261688,
+      "epoch": 4.530386740331492,
+      "grad_norm": 0.7571811079978943,
+      "learning_rate": 6.182471313362717e-05,
+      "loss": 0.026819539070129395,
+      "mean_token_accuracy": 0.9898989200592041,
+      "num_tokens": 50539597.0,
+      "step": 2460
+    },
+    {
+      "entropy": 0.9450563549995422,
+      "epoch": 4.5488029465930016,
+      "grad_norm": 0.6651087403297424,
+      "learning_rate": 6.15285985096498e-05,
+      "loss": 0.02665227949619293,
+      "mean_token_accuracy": 0.9897156655788422,
+      "num_tokens": 50744926.0,
+      "step": 2470
+    },
+    {
+      "entropy": 0.9715635657310486,
+      "epoch": 4.567219152854512,
+      "grad_norm": 0.7445545196533203,
+      "learning_rate": 6.12320562122697e-05,
+      "loss": 0.026212453842163086,
+      "mean_token_accuracy": 0.9904700636863708,
+      "num_tokens": 50950152.0,
+      "step": 2480
+    },
+    {
+      "entropy": 0.9613442063331604,
+      "epoch": 4.585635359116022,
+      "grad_norm": 0.7168459296226501,
+      "learning_rate": 6.0935097242238837e-05,
+      "loss": 0.02508128583431244,
+      "mean_token_accuracy": 0.9901923894882202,
+      "num_tokens": 51155430.0,
+      "step": 2490
+    },
+    {
+      "entropy": 0.9571944534778595,
+      "epoch": 4.6040515653775325,
+      "grad_norm": 0.7590732574462891,
+      "learning_rate": 6.063773261576646e-05,
+      "loss": 0.025445500016212465,
+      "mean_token_accuracy": 0.9902949810028077,
+      "num_tokens": 51360826.0,
+      "step": 2500
+    },
+    {
+      "entropy": 0.947079461812973,
+      "epoch": 4.622467771639043,
+      "grad_norm": 0.6942175030708313,
+      "learning_rate": 6.033997336411035e-05,
+      "loss": 0.026132801175117494,
+      "mean_token_accuracy": 0.9900939345359803,
+      "num_tokens": 51566095.0,
+      "step": 2510
+    },
+    {
+      "entropy": 0.970003741979599,
+      "epoch": 4.640883977900552,
+      "grad_norm": 0.6562672257423401,
+      "learning_rate": 6.00418305331675e-05,
+      "loss": 0.024759869277477264,
+      "mean_token_accuracy": 0.9905019223690033,
+      "num_tokens": 51771177.0,
+      "step": 2520
+    },
+    {
+      "entropy": 0.9715348601341247,
+      "epoch": 4.6593001841620625,
+      "grad_norm": 0.6151639819145203,
+      "learning_rate": 5.9743315183064564e-05,
+      "loss": 0.024138522148132325,
+      "mean_token_accuracy": 0.9910101473331452,
+      "num_tokens": 51976349.0,
+      "step": 2530
+    },
+    {
+      "entropy": 0.9552160143852234,
+      "epoch": 4.677716390423573,
+      "grad_norm": 0.968815267086029,
+      "learning_rate": 5.9444438387747336e-05,
+      "loss": 0.027274739742279053,
+      "mean_token_accuracy": 0.9896075248718261,
+      "num_tokens": 52181820.0,
+      "step": 2540
+    },
+    {
+      "entropy": 0.9265012145042419,
+      "epoch": 4.696132596685083,
+      "grad_norm": 0.8966720700263977,
+      "learning_rate": 5.914521123457015e-05,
+      "loss": 0.0291823148727417,
+      "mean_token_accuracy": 0.9886700630187988,
+      "num_tokens": 52387511.0,
+      "step": 2550
+    },
+    {
+      "entropy": 0.9156096875667572,
+      "epoch": 4.714548802946593,
+      "grad_norm": 0.7747519612312317,
+      "learning_rate": 5.88456448238844e-05,
+      "loss": 0.02809179127216339,
+      "mean_token_accuracy": 0.9891100466251374,
+      "num_tokens": 52592737.0,
+      "step": 2560
+    },
+    {
+      "entropy": 0.924511456489563,
+      "epoch": 4.732965009208103,
+      "grad_norm": 1.0087049007415771,
+      "learning_rate": 5.8545750268626844e-05,
+      "loss": 0.02683232128620148,
+      "mean_token_accuracy": 0.9896528899669648,
+      "num_tokens": 52798814.0,
+      "step": 2570
+    },
+    {
+      "entropy": 0.9662951111793519,
+      "epoch": 4.751381215469613,
+      "grad_norm": 0.7709590792655945,
+      "learning_rate": 5.824553869390734e-05,
+      "loss": 0.02503817081451416,
+      "mean_token_accuracy": 0.9900161385536194,
+      "num_tokens": 53004478.0,
+      "step": 2580
+    },
+    {
+      "entropy": 0.9889141619205475,
+      "epoch": 4.769797421731123,
+      "grad_norm": 0.815858006477356,
+      "learning_rate": 5.794502123659613e-05,
+      "loss": 0.026327347755432128,
+      "mean_token_accuracy": 0.9900785744190216,
+      "num_tokens": 53209888.0,
+      "step": 2590
+    },
+    {
+      "entropy": 0.9785685896873474,
+      "epoch": 4.788213627992634,
+      "grad_norm": 0.6514431238174438,
+      "learning_rate": 5.7644209044910735e-05,
+      "loss": 0.025033789873123168,
+      "mean_token_accuracy": 0.9902650475502014,
+      "num_tokens": 53415533.0,
+      "step": 2600
+    },
+    {
+      "entropy": 0.9723869919776916,
+      "epoch": 4.806629834254144,
+      "grad_norm": 0.8778963685035706,
+      "learning_rate": 5.7343113278002284e-05,
+      "loss": 0.02379843294620514,
+      "mean_token_accuracy": 0.9909472465515137,
+      "num_tokens": 53620850.0,
+      "step": 2610
+    },
+    {
+      "entropy": 0.9572711050510406,
+      "epoch": 4.8250460405156534,
+      "grad_norm": 0.8927134871482849,
+      "learning_rate": 5.70417451055417e-05,
+      "loss": 0.024856947362422943,
+      "mean_token_accuracy": 0.9904125213623047,
+      "num_tokens": 53826259.0,
+      "step": 2620
+    },
+    {
+      "entropy": 0.9523135125637054,
+      "epoch": 4.843462246777164,
+      "grad_norm": 0.6832691431045532,
+      "learning_rate": 5.674011570730523e-05,
+      "loss": 0.025352203845977785,
+      "mean_token_accuracy": 0.990432596206665,
+      "num_tokens": 54031531.0,
+      "step": 2630
+    },
+    {
+      "entropy": 0.9735220730304718,
+      "epoch": 4.861878453038674,
+      "grad_norm": 0.6399164795875549,
+      "learning_rate": 5.643823627275972e-05,
+      "loss": 0.026541513204574586,
+      "mean_token_accuracy": 0.9900369107723236,
+      "num_tokens": 54237155.0,
+      "step": 2640
+    },
+    {
+      "entropy": 0.9566517114639282,
+      "epoch": 4.880294659300184,
+      "grad_norm": 0.8725414276123047,
+      "learning_rate": 5.6136118000647616e-05,
+      "loss": 0.02675778865814209,
+      "mean_token_accuracy": 0.9894899427890778,
+      "num_tokens": 54442739.0,
+      "step": 2650
+    },
+    {
+      "entropy": 0.9447909593582153,
+      "epoch": 4.898710865561695,
+      "grad_norm": 0.8169302344322205,
+      "learning_rate": 5.583377209857138e-05,
+      "loss": 0.02642086148262024,
+      "mean_token_accuracy": 0.989885401725769,
+      "num_tokens": 54648098.0,
+      "step": 2660
+    },
+    {
+      "entropy": 0.9180052697658538,
+      "epoch": 4.917127071823204,
+      "grad_norm": 0.7768753170967102,
+      "learning_rate": 5.553120978257787e-05,
+      "loss": 0.02552323341369629,
+      "mean_token_accuracy": 0.9899512350559234,
+      "num_tokens": 54854281.0,
+      "step": 2670
+    },
+    {
+      "entropy": 0.917166668176651,
+      "epoch": 4.935543278084714,
+      "grad_norm": 0.8241410851478577,
+      "learning_rate": 5.5228442276742153e-05,
+      "loss": 0.02788199484348297,
+      "mean_token_accuracy": 0.989625746011734,
+      "num_tokens": 55059495.0,
+      "step": 2680
+    },
+    {
+      "entropy": 0.9345465302467346,
+      "epoch": 4.953959484346225,
+      "grad_norm": 0.7645496129989624,
+      "learning_rate": 5.4925480812751166e-05,
+      "loss": 0.02517639398574829,
+      "mean_token_accuracy": 0.9902283847332001,
+      "num_tokens": 55265381.0,
+      "step": 2690
+    },
+    {
+      "entropy": 0.9386432528495788,
+      "epoch": 4.972375690607735,
+      "grad_norm": 0.8371859192848206,
+      "learning_rate": 5.46223366294871e-05,
+      "loss": 0.025585666298866272,
+      "mean_token_accuracy": 0.9903791427612305,
+      "num_tokens": 55471210.0,
+      "step": 2700
+    },
+    {
+      "entropy": 0.9267561137676239,
+      "epoch": 4.990791896869245,
+      "grad_norm": 0.6789297461509705,
+      "learning_rate": 5.43190209726104e-05,
+      "loss": 0.024646708369255067,
+      "mean_token_accuracy": 0.9904700815677643,
+      "num_tokens": 55676877.0,
+      "step": 2710
+    },
+    {
+      "epoch": 5.0,
+      "eval_entropy": 0.9283919717954553,
+      "eval_loss": 0.06225527077913284,
+      "eval_mean_token_accuracy": 0.9784110421719758,
+      "eval_num_tokens": 55779559.0,
+      "eval_runtime": 10.0613,
+      "eval_samples_per_second": 363.573,
+      "eval_steps_per_second": 11.43,
+      "step": 2715
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 5430,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 10,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 2.6591019550449336e+18,
+  "train_batch_size": 32,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-2715/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21325c9bdff5ed34f0cc34837ee67ed216c9301ab4d9b2e26f048b563564bd75
+size 5777

checkpoint-3258/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-3258/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-3258/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:342d90add5af5dfae9087ea9a560f86a7ebd48116022794da4d371303756af39
+size 80792096

checkpoint-3258/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-3258/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

checkpoint-3258/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

checkpoint-3258/trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-3258/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21325c9bdff5ed34f0cc34837ee67ed216c9301ab4d9b2e26f048b563564bd75
+size 5777

checkpoint-3801/README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+base_model: Qwen/Qwen2.5-7B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:Qwen/Qwen2.5-7B-Instruct
+- lora
+- sft
+- transformers
+- trl
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

checkpoint-3801/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "v_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

checkpoint-3801/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddf199d0ea570dc0a07cf04650028719a3b079854d9cd9c9c87302cd1e916916
+size 80792096

checkpoint-3801/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

checkpoint-3801/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892