Instructions to use jacob-ml/Jacob-2-E4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jacob-ml/Jacob-2-E4B with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it")
model = PeftModel.from_pretrained(base_model, "jacob-ml/Jacob-2-E4B")

Transformers

How to use jacob-ml/Jacob-2-E4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="jacob-ml/Jacob-2-E4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("jacob-ml/Jacob-2-E4B")
model = AutoModelForMultimodalLM.from_pretrained("jacob-ml/Jacob-2-E4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use jacob-ml/Jacob-2-E4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jacob-ml/Jacob-2-E4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jacob-ml/Jacob-2-E4B

SGLang

How to use jacob-ml/Jacob-2-E4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jacob-ml/Jacob-2-E4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jacob-ml/Jacob-2-E4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use jacob-ml/Jacob-2-E4B with Docker Model Runner:
```
docker model run hf.co/jacob-ml/Jacob-2-E4B
```

Jacob-2-E4B

File size: 409,716 Bytes

[2026-06-14 14:08:51,250] [DEBUG] [axolotl.utils.config.resolve_dtype:74] [PID:3393] bf16 support detected, enabling for this configuration.
[2026-06-14 14:08:51,433] [WARNING] [axolotl.utils.config.normalize_config:281] [PID:3393] Gemma4 requires use_reentrant=False for gradient checkpointing in distributed training. Setting use_reentrant=False.
[2026-06-14 14:08:51,433] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:3393] baseline 0.000GB ()
[2026-06-14 14:08:51,434] [INFO] [axolotl.cli.config.load_cfg:333] [PID:3393] config:
{
  "activation_offloading": true,
  "adapter": "lora",
  "attn_implementation": "sdpa",
  "attn_needs_dtype_cast": false,
  "attn_supports_packing": false,
  "attn_uses_flash_lib": false,
  "axolotl_config_path": "./config.yaml",
  "base_model": "google/gemma-4-E4B-it",
  "base_model_config": "google/gemma-4-E4B-it",
  "batch_size": 16,
  "bf16": true,
  "capabilities": {
    "bf16": true,
    "compute_capability": "sm_80",
    "fp8": false,
    "n_gpu": 1,
    "n_node": 1,
    "tf32": true
  },
  "chat_template": "jinja",
  "chat_template_jinja": "./jinja",
  "context_parallel_size": 1,
  "cut_cross_entropy": true,
  "dataloader_num_workers": 1,
  "dataloader_pin_memory": true,
  "dataloader_prefetch_factor": 256,
  "dataset_num_proc": 31,
  "dataset_prepared_path": "./dataset-e4b",
  "datasets": [
    {
      "chat_template": "tokenizer_default",
      "field_messages": "messages",
      "field_tools": "tools",
      "message_property_mappings": {
        "content": "content",
        "role": "role"
      },
      "path": "jacob-ml/Jacob-2-SSFT-filtered",
      "split": "train",
      "trust_remote_code": false,
      "type": "chat_template"
    }
  ],
  "ddp": false,
  "device": "cuda:0",
  "dion_rank_fraction": 1.0,
  "dion_rank_multiple_of": 1,
  "eaft_alpha": 1.0,
  "eaft_k": 20,
  "env_capabilities": {
    "torch_version": "2.10.0"
  },
  "eval_batch_size": 2,
  "eval_causal_lm_metrics": [
    "sacrebleu",
    "comet",
    "ter",
    "chrf"
  ],
  "eval_max_new_tokens": 128,
  "eval_table_size": 0,
  "experimental_skip_move_to_device": true,
  "fp16": false,
  "freeze_mm_modules": true,
  "generate_samples": false,
  "generation_do_sample": true,
  "generation_max_new_tokens": 50,
  "generation_prompt_ratio": 0.5,
  "generation_temperature": 0.7,
  "gradient_accumulation_steps": 8,
  "gradient_checkpointing": true,
  "gradient_checkpointing_kwargs": {
    "use_reentrant": false
  },
  "hub_model_id": "jacob-ml/Jacob-2-E4B",
  "include_tkps": true,
  "is_multimodal": true,
  "layer_offloading": true,
  "learning_rate": 0.0002,
  "lisa_layers_attribute": "model.layers",
  "load_best_model_at_end": false,
  "load_in_4bit": false,
  "load_in_8bit": true,
  "local_rank": 0,
  "logging_steps": 1,
  "lora_alpha": 16,
  "lora_dropout": 0.0,
  "lora_r": 16,
  "lora_target_modules": "model.language_model.layers.[\\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj",
  "loraplus_lr_embedding": 1e-06,
  "lr_scheduler": "cosine",
  "mean_resizing_embeddings": false,
  "merge_method": "memory_efficient",
  "micro_batch_size": 2,
  "model_config_type": "gemma4",
  "model_config_type_text": "gemma4_text",
  "num_epochs": 1.0,
  "num_generation_samples": 3,
  "optimizer": "adamw_torch_8bit",
  "otel_metrics_host": "localhost",
  "otel_metrics_port": 8000,
  "output_dir": "./outputs/Jacob-2-E4B",
  "pad_to_sequence_len": false,
  "plugins": [
    "axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin"
  ],
  "pretrain_multipack_attn": true,
  "processor_config": "google/gemma-4-E4B-it",
  "profiler_steps_start": 0,
  "qgalore_cos_threshold": 0.4,
  "qgalore_gamma_proj": 2,
  "qgalore_proj_bits": 4,
  "qgalore_proj_group_size": 256,
  "qgalore_proj_quant": true,
  "qgalore_proj_type": "std",
  "qgalore_queue_size": 5,
  "qgalore_rank": 256,
  "qgalore_scale": 0.25,
  "qgalore_update_proj_gap": 200,
  "qlora_sharded_model_loading": false,
  "quantize_moe_experts": false,
  "ray_num_workers": 1,
  "relora_prune_method": "magnitude",
  "resources_per_worker": {
    "GPU": 1
  },
  "sample_packing": false,
  "sample_packing_bin_size": 200,
  "sample_packing_group_size": 100000,
  "save_only_model": false,
  "save_safetensors": true,
  "sequence_len": 8192,
  "shuffle_before_merging_datasets": false,
  "shuffle_merged_datasets": true,
  "skip_prepare_dataset": false,
  "streaming_multipack_buffer_size": 10000,
  "strict": false,
  "tensor_parallel_size": 1,
  "tf32": false,
  "tiled_mlp_use_original_mlp": true,
  "tokenizer_config": "google/gemma-4-E4B-it",
  "tokenizer_save_jinja_files": true,
  "torch_dtype": "torch.bfloat16",
  "train_on_inputs": false,
  "trl": {
    "async_prefetch": false,
    "log_completions": false,
    "mask_truncated_completions": false,
    "ref_model_mixup_alpha": 0.9,
    "ref_model_sync_steps": 64,
    "replay_buffer_size": 0,
    "replay_recompute_logps": true,
    "reroll_max_groups": 1,
    "reroll_start_fraction": 1.0,
    "reward_num_workers": 1,
    "scale_rewards": true,
    "skip_zero_advantage_batches": true,
    "sync_ref_model": false,
    "use_data_producer": false,
    "use_vllm": false,
    "vllm_lora_sync": false,
    "vllm_server_host": "0.0.0.0",
    "vllm_server_port": 8000
  },
  "use_otel_metrics": false,
  "use_ray": false,
  "val_set_size": 0.0,
  "vllm": {
    "device": "auto",
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "host": "0.0.0.0",
    "port": 8000
  },
  "warmup_ratio": 0.1,
  "weight_decay": 0.0,
  "world_size": 1
}
[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos>
[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos>
[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad>
[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk>
[2026-06-14 14:08:55,227] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:482] [PID:3393] Unable to find prepared dataset in dataset-e4b/226f5539ba5a2355ba6a34bd68b2a326
[2026-06-14 14:08:55,228] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:3393] Loading raw datasets...
[2026-06-14 14:08:55,228] [WARNING] [axolotl.utils.data.sft._load_raw_datasets:322] [PID:3393] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset using `axolotl preprocess path/to/config.yml`.

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 0 files: 0it [00:00, ?it/s][A
Fetching 0 files: 0it [00:00, ?it/s]

Download complete: : 0.00B [00:00, ?B/s]              
Download complete: : 0.00B [00:00, ?B/s]
[2026-06-14 14:08:56,312] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:3393] Loading dataset: jacob-ml/Jacob-2-SSFT-filtered with base_type: chat_template and prompt_style: None
[2026-06-14 14:08:56,315] [INFO] [axolotl.prompt_strategies.chat_template.__call__:1209] [PID:3393] Using chat template:
---
{%- macro format_parameters(properties, required, filter_keys=false) -%}
    {%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%}
    {%- set ns = namespace(found_first=false) -%}
    {%- for key, value in properties | dictsort -%}
        {%- set add_comma = false -%}
        {%- if not filter_keys or key not in standard_keys -%}
            {%- if ns.found_first %},{% endif -%}
            {%- set ns.found_first = true -%}
            {{ key }}:{
            {%- if value['description'] -%}
                description:<|"|>{{ value['description'] }}<|"|>
                {%- set add_comma = true -%}
            {%- endif -%}
            {%- if value['type'] | upper == 'STRING' -%}
                {%- if value['enum'] -%}
                    {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                    enum:{{ format_argument(value['enum']) }}
                {%- endif -%}
            {%- elif value['type'] | upper == 'ARRAY' -%}
                {%- if value['items'] is mapping and value['items'] -%}
                    {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                    items:{
                    {%- set ns_items = namespace(found_first=false) -%}
                    {%- for item_key, item_value in value['items'] | dictsort -%}
                        {%- if item_value is not none -%}
                            {%- if ns_items.found_first %},{% endif -%}
                            {%- set ns_items.found_first = true -%}
                            {%- if item_key == 'properties' -%}
                                properties:{
                                {%- if item_value is mapping -%}
                                    {{- format_parameters(item_value, value['items']['required'] | default([])) -}}
                                {%- endif -%}
                                }
                            {%- elif item_key == 'required' -%}
                                required:[
                                {%- for req_item in item_value -%}
                                    <|"|>{{- req_item -}}<|"|>
                                    {%- if not loop.last %},{% endif -%}
                                {%- endfor -%}
                                ]
                            {%- elif item_key == 'type' -%}
                                {%- if item_value is string -%}
                                    type:{{ format_argument(item_value | upper) }}
                                {%- else -%}
                                    type:{{ format_argument(item_value | map('upper') | list) }}
                                {%- endif -%}
                            {%- else -%}
                                {{ item_key }}:{{ format_argument(item_value) }}
                            {%- endif -%}
                        {%- endif -%}
                    {%- endfor -%}
                    }
                {%- endif -%}
            {%- endif -%}
            {%- if value['nullable'] %}
                {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                nullable:true
            {%- endif -%}
            {%- if value['type'] | upper == 'OBJECT' -%}
                {%- if value['properties'] is defined and value['properties'] is mapping -%}
                    {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                    properties:{
                    {{- format_parameters(value['properties'], value['required'] | default([])) -}}
                    }
                {%- elif value is mapping -%}
                    {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                    properties:{
                    {{- format_parameters(value, value['required'] | default([]), filter_keys=true) -}}
                    }
                {%- endif -%}
                {%- if value['required'] -%}
                    {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
                    required:[
                    {%- for item in value['required'] | default([]) -%}
                        <|"|>{{- item -}}<|"|>
                        {%- if not loop.last %},{% endif -%}
                    {%- endfor -%}
                    ]
                {%- endif -%}
            {%- endif -%}
            {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
            type:<|"|>{{ value['type'] | upper }}<|"|>}
        {%- endif -%}
    {%- endfor -%}
{%- endmacro -%}
{%- macro format_function_declaration(tool_data) -%}
    declaration:{{- tool_data['function']['name'] -}}{description:<|"|>{{- tool_data['function']['description'] -}}<|"|>
    {%- set params = tool_data['function']['parameters'] -%}
    {%- if params -%}
        ,parameters:{
        {%- if params['properties'] -%}
            properties:{ {{- format_parameters(params['properties'], params['required']) -}} },
        {%- endif -%}
        {%- if params['required'] -%}
            required:[
            {%- for item in params['required'] -%}
                <|"|>{{- item -}}<|"|>
                {{- ',' if not loop.last -}}
            {%- endfor -%}
            ],
        {%- endif -%}
        {%- if params['type'] -%}
            type:<|"|>{{- params['type'] | upper -}}<|"|>}
        {%- endif -%}
    {%- endif -%}
    {%- if 'response' in tool_data['function'] -%}
        {%- set response_declaration = tool_data['function']['response'] -%}
        ,response:{
        {%- if response_declaration['description'] -%}
            description:<|"|>{{- response_declaration['description'] -}}<|"|>,
        {%- endif -%}
        {%- if response_declaration['type'] | upper == 'OBJECT' -%}
            type:<|"|>{{- response_declaration['type'] | upper -}}<|"|>}
        {%- endif -%}
    {%- endif -%}
    }
{%- endmacro -%}
{%- macro format_argument(argument, escape_keys=True) -%}
    {%- if argument is string -%}
        {{- '<|"|>' + argument + '<|"|>' -}}
    {%- elif argument is boolean -%}
        {{- 'true' if argument else 'false' -}}
    {%- elif argument is mapping -%}
        {{- '{' -}}
        {%- set ns = namespace(found_first=false) -%}
        {%- for key, value in argument | dictsort -%}
            {%- if ns.found_first %},{% endif -%}
            {%- set ns.found_first = true -%}
            {%- if escape_keys -%}
                {{- '<|"|>' + key + '<|"|>' -}}
            {%- else -%}
                {{- key -}}
            {%- endif -%}
            :{{- format_argument(value, escape_keys=escape_keys) -}}
        {%- endfor -%}
        {{- '}' -}}
    {%- elif argument is sequence -%}
        {{- '[' -}}
        {%- for item in argument -%}
            {{- format_argument(item, escape_keys=escape_keys) -}}
            {%- if not loop.last %},{% endif -%}
        {%- endfor -%}
        {{- ']' -}}
    {%- else -%}
        {{- argument -}}
    {%- endif -%}
{%- endmacro -%}
{%- macro strip_thinking(text) -%}
    {%- set ns = namespace(result='') -%}
    {%- for part in text.split('<channel|>') -%}
        {%- if '<|channel>' in part -%}
            {%- set ns.result = ns.result + part.split('<|channel>')[0] -%}
        {%- else -%}
            {%- set ns.result = ns.result + part -%}
        {%- endif -%}
    {%- endfor -%}
    {{- ns.result | trim -}}
{%- endmacro -%}

{%- macro format_tool_response_block(tool_name, response) -%}
    {{- '<|tool_response>' -}}
    {%- if response is mapping -%}
        {{- 'response:' + tool_name + '{' -}}
        {%- for key, value in response | dictsort -%}
            {{- key -}}:{{- format_argument(value, escape_keys=False) -}}
            {%- if not loop.last %},{% endif -%}
        {%- endfor -%}
        {{- '}' -}}
    {%- else -%}
        {{- 'response:' + tool_name + '{value:' + format_argument(response, escape_keys=False) + '}' -}}
    {%- endif -%}
    {{- '<tool_response|>' -}}
{%- endmacro -%}

{%- set ns = namespace(prev_message_type=None) -%}
{%- set loop_messages = messages -%}
{{- bos_token -}}
{#- Handle System/Tool Definitions Block -#}
{%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%}
    {{- '<|turn>system\n' -}}
    {#- Inject Thinking token at the very top of the FIRST system turn -#}
    {%- if enable_thinking is defined and enable_thinking -%}
        {{- '<|think|>\n' -}}
        {%- set ns.prev_message_type = 'think' -%}
    {%- endif -%}
    {%- if messages[0]['role'] in ['system', 'developer'] -%}
        {%- if messages[0]['content'] is string -%}
            {{- messages[0]['content'] | trim -}}
        {%- elif messages[0]['content'] is sequence -%}
            {%- for item in messages[0]['content'] -%}
                {{- item['text'] | trim + ' '-}}
            {%- endfor -%}
        {%- endif -%}
        {%- set loop_messages = messages[1:] -%}
    {%- endif -%}
    {%- if tools -%}
        {%- for tool in tools %}
            {{- '<|tool>' -}}
            {{- format_function_declaration(tool) | trim -}}
            {{- '<tool|>' -}}
        {%- endfor %}
        {%- set ns.prev_message_type = 'tool' -%}
    {%- endif -%}
    {{- '<turn|>\n' -}}
{%- endif %}

{#- Pre-scan: find last user message index for reasoning guard -#}
{%- set ns_turn = namespace(last_user_idx=-1) -%}
{%- for i in range(loop_messages | length) -%}
    {%- if loop_messages[i]['role'] == 'user' -%}
        {%- set ns_turn.last_user_idx = i -%}
    {%- endif -%}
{%- endfor -%}

{#- Loop through messages -#}
{%- for message in loop_messages -%}
    {%- if message['role'] != 'tool' -%}
    {%- set ns.prev_message_type = None -%}
    {%- set role = 'model' if message['role'] == 'assistant' else message['role'] -%}
    {%- if message['role'] == 'tool' and 'name' in message -%}
        {%- set _tool_name = message['name'] -%}
    {%- endif -%}
    {#- Detect continuation: suppress duplicate <|turn>model when previous non-tool message was also assistant -#}
    {%- set prev_nt = namespace(role=None, found=false) -%}
    {%- if loop.index0 > 0 -%}
        {%- for j in range(loop.index0 - 1, -1, -1) -%}
            {%- if not prev_nt.found -%}
                {%- if loop_messages[j]['role'] != 'tool' -%}
                    {%- set prev_nt.role = loop_messages[j]['role'] -%}
                    {%- set prev_nt.found = true -%}
                {%- endif -%}
            {%- endif -%}
        {%- endfor -%}
    {%- endif -%}
    {%- set continue_same_model_turn = (role == 'model' and prev_nt.role == 'assistant') -%}
    {%- if not continue_same_model_turn -%}
        {{- '<|turn>' + role + '\n' }}
    {%- endif -%}

    {#- Render reasoning/reasoning_content as thinking channel -#}
    {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
    {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
        {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
    {%- endif -%}

            {%- if message['tool_calls'] -%}
                {%- for tool_call in message['tool_calls'] -%}
                    {%- set function = tool_call['function'] -%}
                    {{- '<|tool_call>call:' + function['name'] + '{' -}}
                    {%- if function['arguments'] is mapping -%}
                        {%- set ns_args = namespace(found_first=false) -%}
                        {%- for key, value in function['arguments'] | dictsort -%}
                            {%- if ns_args.found_first %},{% endif -%}
                            {%- set ns_args.found_first = true -%}
                            {{- key -}}:{{- format_argument(value, escape_keys=False) -}}
                        {%- endfor -%}
                    {%- elif function['arguments'] is string -%}
                        {{- function['arguments'] -}}
                    {%- endif -%}
                    {{- '}<tool_call|>' -}}
                {%- endfor -%}
                {%- set ns.prev_message_type = 'tool_call' -%}
            {%- endif -%}

            {%- set ns_tr_out = namespace(flag=false) -%}
            {%- if message.get('tool_responses') -%}
                {#- Legacy: tool_responses embedded on the assistant message (Google/Gemma native) -#}
                {%- for tool_response in message['tool_responses'] -%}
                    {{- format_tool_response_block(tool_response['name'] | default('unknown', true), tool_response['response']) -}}
                    {%- set ns_tr_out.flag = true -%}
                    {%- set ns.prev_message_type = 'tool_response' -%}
                {%- endfor -%}
            {%- elif message.get('tool_calls') -%}
                {#- OpenAI Chat Completions: forward-scan consecutive role:tool messages -#}
                {%- set ns_tool_scan = namespace(stopped=false) -%}
                {%- for k in range(loop.index0 + 1, loop_messages | length) -%}
                    {%- if ns_tool_scan.stopped -%}
                    {%- elif loop_messages[k]['role'] != 'tool' -%}
                        {%- set ns_tool_scan.stopped = true -%}
                    {%- else -%}
                        {%- set follow = loop_messages[k] -%}
                        {#- Resolve tool_call_id to function name -#}
                        {%- set ns_tname = namespace(name=follow['name'] | default('unknown', true)) -%}
                        {%- for tc in message['tool_calls'] -%}
                            {%- if tc.get('id') == follow.get('tool_call_id') -%}
                                {%- set ns_tname.name = tc['function']['name'] -%}
                            {%- endif -%}
                        {%- endfor -%}
                        {#- Handle content as string or content-parts array -#}
                        {%- set tool_body = follow.get('content') -%}
                        {%- if tool_body is string -%}
                            {{- format_tool_response_block(ns_tname.name, tool_body) -}}
                        {%- elif tool_body is sequence and tool_body is not string -%}
                            {%- set ns_txt = namespace(s='') -%}
                            {%- for part in tool_body -%}
                                {%- if part.get('type') == 'text' -%}
                                    {%- set ns_txt.s = ns_txt.s + (part.get('text') | default('')) -%}
                                {%- endif -%}
                            {%- endfor -%}
                            {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
                            {%- for part in tool_body -%}
                                {%- if part.get('type') == 'image' -%}
                                    {{- '<|image|>' -}}
                                {%- elif part.get('type') == 'audio' -%}
                                    {{- '<|audio|>' -}}
                                {%- elif part.get('type') == 'video' -%}
                                    {{- '<|video|>' -}}
                                {%- endif -%}
                            {%- endfor -%}
                        {%- else -%}
                            {{- format_tool_response_block(ns_tname.name, tool_body) -}}
                        {%- endif -%}
                        {%- set ns_tr_out.flag = true -%}
                        {%- set ns.prev_message_type = 'tool_response' -%}
                    {%- endif -%}
                {%- endfor -%}
            {%- endif -%}

            {%- set captured_content -%}
            {%- if message['content'] is string -%}
                {%- if role == 'model' -%}
                    {{- strip_thinking(message['content']) -}}
                {%- else -%}
                    {{- message['content'] | trim -}}
                {%- endif -%}
            {%- elif message['content'] is sequence -%}
                {%- for item in message['content'] -%}
                    {%- if item['type'] == 'text' -%}
                        {%- if role == 'model' -%}
                            {{- strip_thinking(item['text']) -}}
                        {%- else -%}
                            {{- item['text'] | trim -}}
                        {%- endif -%}
                    {%- elif item['type'] == 'image' -%}
                        {{- '<|image|>' -}}
                        {%- set ns.prev_message_type = 'image' -%}
                    {%- elif item['type'] == 'audio' -%}
                        {{- '<|audio|>' -}}
                        {%- set ns.prev_message_type = 'audio' -%}
                    {%- elif item['type'] == 'video' -%}
                        {{- '<|video|>' -}}
                        {%- set ns.prev_message_type = 'video' -%}
                    {%- endif -%}
                {%- endfor -%}
            {%- endif -%}
            {%- endset -%}

            {{- captured_content -}}
            {%- set has_content = captured_content | trim | length > 0 -%}

        {%- if ns.prev_message_type == 'tool_call' and not ns_tr_out.flag -%}
            {{- '<|tool_response>' -}}
        {%- elif not (ns_tr_out.flag and not has_content) -%}
            {{- '<turn|>\n' -}}
        {%- endif -%}
    {%- endif -%}
{%- endfor -%}

{%- if add_generation_prompt -%}
    {%- if ns.prev_message_type != 'tool_response' and ns.prev_message_type != 'tool_call' -%}
        {{- '<|turn>model\n' -}}
    {%- endif -%}
{%- endif -%}

---
[2026-06-14 14:08:56,434] [WARNING] [axolotl.prompt_strategies.chat_template._validate_eot_and_eos_tokens:357] [PID:3393] EOS token '<eos>' not found in chat_template. Please check if your template/EOS token is correct.

Tokenizing Prompts (num_proc=31):   0%|                                                                                                                       | 0/4209 [00:00<?, ? examples/s]
Tokenizing Prompts (num_proc=31):   3%|███▌                                                                                                         | 136/4209 [00:10<05:02, 13.47 examples/s]
Tokenizing Prompts (num_proc=31):   6%|███████                                                                                                      | 272/4209 [00:15<03:38, 18.03 examples/s]
Tokenizing Prompts (num_proc=31):  10%|██████████▌                                                                                                  | 408/4209 [00:21<03:05, 20.53 examples/s]
Tokenizing Prompts (num_proc=31):  13%|██████████████                                                                                               | 544/4209 [00:26<02:38, 23.07 examples/s]
Tokenizing Prompts (num_proc=31):  16%|█████████████████▌                                                                                           | 680/4209 [00:30<02:18, 25.49 examples/s]
Tokenizing Prompts (num_proc=31):  19%|█████████████████████▏                                                                                       | 816/4209 [00:35<02:09, 26.25 examples/s]
Tokenizing Prompts (num_proc=31):  23%|████████████████████████▋                                                                                    | 952/4209 [00:39<01:52, 28.90 examples/s]
Tokenizing Prompts (num_proc=31):  26%|███████████████████████████▉                                                                                | 1088/4209 [00:43<01:44, 29.81 examples/s]
Tokenizing Prompts (num_proc=31):  29%|███████████████████████████████▍                                                                            | 1224/4209 [00:47<01:39, 29.92 examples/s]
Tokenizing Prompts (num_proc=31):  32%|██████████████████████████████████▉                                                                         | 1360/4209 [00:51<01:31, 31.06 examples/s]
Tokenizing Prompts (num_proc=31):  36%|██████████████████████████████████████▍                                                                     | 1496/4209 [00:56<01:25, 31.57 examples/s]
Tokenizing Prompts (num_proc=31):  39%|█████████████████████████████████████████▉                                                                  | 1632/4209 [01:00<01:21, 31.66 examples/s]
Tokenizing Prompts (num_proc=31):  42%|█████████████████████████████████████████████▎                                                              | 1768/4209 [01:03<01:12, 33.71 examples/s]
Tokenizing Prompts (num_proc=31):  45%|████████████████████████████████████████████████▊                                                           | 1904/4209 [01:07<01:05, 35.43 examples/s]
Tokenizing Prompts (num_proc=31):  48%|████████████████████████████████████████████████████▎                                                       | 2040/4209 [01:12<01:07, 31.92 examples/s]
Tokenizing Prompts (num_proc=31):  52%|███████████████████████████████████████████████████████▊                                                    | 2176/4209 [01:16<01:03, 32.02 examples/s]
Tokenizing Prompts (num_proc=31):  55%|███████████████████████████████████████████████████████████▎                                                | 2312/4209 [01:20<00:57, 32.98 examples/s]
Tokenizing Prompts (num_proc=31):  58%|██████████████████████████████████████████████████████████████▊                                             | 2448/4209 [01:24<00:54, 32.61 examples/s]
Tokenizing Prompts (num_proc=31):  61%|██████████████████████████████████████████████████████████████████▎                                         | 2584/4209 [01:28<00:49, 32.89 examples/s]
Tokenizing Prompts (num_proc=31):  65%|█████████████████████████████████████████████████████████████████████▊                                      | 2720/4209 [01:32<00:44, 33.45 examples/s]
Tokenizing Prompts (num_proc=31):  68%|█████████████████████████████████████████████████████████████████████████▎                                  | 2856/4209 [01:36<00:38, 35.19 examples/s]
Tokenizing Prompts (num_proc=31):  71%|████████████████████████████████████████████████████████████████████████████▊                               | 2992/4209 [01:40<00:36, 33.45 examples/s]
Tokenizing Prompts (num_proc=31):  74%|████████████████████████████████████████████████████████████████████████████████▎                           | 3128/4209 [01:44<00:32, 33.28 examples/s]
Tokenizing Prompts (num_proc=31):  78%|███████████████████████████████████████████████████████████████████████████████████▊                        | 3264/4209 [01:48<00:28, 33.11 examples/s]
Tokenizing Prompts (num_proc=31):  81%|███████████████████████████████████████████████████████████████████████████████████████▏                    | 3399/4209 [01:52<00:24, 33.56 examples/s]
Tokenizing Prompts (num_proc=31):  84%|██████████████████████████████████████████████████████████████████████████████████████████▋                 | 3534/4209 [01:56<00:20, 33.67 examples/s]
Tokenizing Prompts (num_proc=31):  87%|██████████████████████████████████████████████████████████████████████████████████████████████▏             | 3669/4209 [02:01<00:17, 31.47 examples/s]
Tokenizing Prompts (num_proc=31):  90%|█████████████████████████████████████████████████████████████████████████████████████████████████▌          | 3804/4209 [02:04<00:11, 34.01 examples/s]
Tokenizing Prompts (num_proc=31):  94%|█████████████████████████████████████████████████████████████████████████████████████████████████████       | 3939/4209 [02:09<00:08, 33.46 examples/s]
Tokenizing Prompts (num_proc=31):  97%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌   | 4074/4209 [02:13<00:04, 33.27 examples/s]
Tokenizing Prompts (num_proc=31): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4209/4209 [02:18<00:00, 30.78 examples/s]
Tokenizing Prompts (num_proc=31): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4209/4209 [02:19<00:00, 30.16 examples/s]
[2026-06-14 14:11:56,198] [INFO] [axolotl.utils.data.utils._log_dataset_stats:212] [PID:3393] min_input_len: 320
[2026-06-14 14:11:56,199] [INFO] [axolotl.utils.data.utils._log_dataset_stats:213] [PID:3393] max_input_len: 23372

Dropping Invalid Sequences (<None or >8192) (num_proc=31):   0%|                                                                                              | 0/4209 [00:00<?, ? examples/s]
Dropping Invalid Sequences (<None or >8192) (num_proc=31):   3%|██▋                                                                                 | 136/4209 [00:01<00:58, 69.73 examples/s]
Dropping Invalid Sequences (<None or >8192) (num_proc=31):  36%|█████████████████████████████▏                                                    | 1496/4209 [00:02<00:02, 992.91 examples/s]
Dropping Invalid Sequences (<None or >8192) (num_proc=31):  55%|████████████████████████████████████████████▍                                    | 2311/4209 [00:02<00:01, 1592.28 examples/s]
Dropping Invalid Sequences (<None or >8192) (num_proc=31):  74%|████████████████████████████████████████████████████████████▏                    | 3126/4209 [00:02<00:00, 2289.56 examples/s]
Dropping Invalid Sequences (<None or >8192) (num_proc=31): 100%|█████████████████████████████████████████████████████████████████████████████████| 4209/4209 [00:02<00:00, 1595.10 examples/s]
[2026-06-14 14:11:58,919] [INFO] [axolotl.utils.data.utils._drop_outside_range:306] [PID:3393] Dropped 15 sequences outside valid range ([None, 8192])

Saving the dataset (0/16 shards):   0%|                                                                                                                       | 0/4194 [00:00<?, ? examples/s]
Saving the dataset (0/16 shards):   6%|██████▊                                                                                                      | 263/4194 [00:14<03:31, 18.60 examples/s]
Saving the dataset (1/16 shards):  13%|█████████████▋                                                                                               | 526/4194 [00:14<03:17, 18.60 examples/s]
Saving the dataset (2/16 shards):  19%|████████████████████▍                                                                                        | 788/4194 [00:14<03:03, 18.60 examples/s]
Saving the dataset (3/16 shards):  19%|████████████████████▍                                                                                        | 788/4194 [00:14<03:03, 18.60 examples/s]
Saving the dataset (4/16 shards):  25%|███████████████████████████                                                                                 | 1050/4194 [00:14<02:49, 18.60 examples/s]
Saving the dataset (5/16 shards):  31%|█████████████████████████████████▊                                                                          | 1312/4194 [00:14<02:34, 18.60 examples/s]
Saving the dataset (6/16 shards):  38%|████████████████████████████████████████▌                                                                   | 1574/4194 [00:14<02:20, 18.60 examples/s]
Saving the dataset (7/16 shards):  44%|███████████████████████████████████████████████▎                                                            | 1836/4194 [00:14<02:06, 18.60 examples/s]
Saving the dataset (8/16 shards):  50%|██████████████████████████████████████████████████████                                                      | 2098/4194 [00:14<01:52, 18.60 examples/s]
Saving the dataset (9/16 shards):  56%|████████████████████████████████████████████████████████████▊                                               | 2360/4194 [00:14<01:38, 18.60 examples/s]
Saving the dataset (10/16 shards):  63%|██████████████████████████████████████████████████████████████████▉                                        | 2622/4194 [00:14<01:24, 18.60 examples/s]
Saving the dataset (11/16 shards):  69%|█████████████████████████████████████████████████████████████████████████▌                                 | 2884/4194 [00:14<01:10, 18.60 examples/s]
Saving the dataset (12/16 shards):  75%|████████████████████████████████████████████████████████████████████████████████▎                          | 3146/4194 [00:14<00:56, 18.60 examples/s]
Saving the dataset (13/16 shards):  81%|██████████████████████████████████████████████████████████████████████████████████████▉                    | 3408/4194 [00:14<00:42, 18.60 examples/s]
Saving the dataset (14/16 shards):  88%|█████████████████████████████████████████████████████████████████████████████████████████████▋             | 3670/4194 [00:14<00:28, 18.60 examples/s]
Saving the dataset (15/16 shards):  94%|████████████████████████████████████████████████████████████████████████████████████████████████████▎      | 3932/4194 [00:14<00:14, 18.60 examples/s]
Saving the dataset (16/16 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4194/4194 [00:14<00:00, 18.60 examples/s]
Saving the dataset (16/16 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4194/4194 [00:15<00:00, 270.70 examples/s]
[2026-06-14 14:12:15,030] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:420] [PID:3393] total_num_tokens: 6_012_949
[2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:438] [PID:3393] `total_supervised_tokens: 3_764_690`
[2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:521] [PID:3393] total_num_steps: 263
[2026-06-14 14:12:15,143] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:121] [PID:3393] Maximum number of steps set at 263
[2026-06-14 14:12:15,405] [DEBUG] [axolotl.train.setup_model_and_tokenizer:70] [PID:3393] loading tokenizer... google/gemma-4-E4B-it
[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos>
[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos>
[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad>
[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk>
[2026-06-14 14:12:24,886] [DEBUG] [axolotl.train.setup_model_and_tokenizer:81] [PID:3393] Loading model
[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:75] [PID:3393] Patched OptimState8bit for torch.compile compatibility
[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:122] [PID:3393] Patched OptimState4bit for torch.compile compatibility
[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:154] [PID:3393] Patched OptimStateFp8 for torch.compile compatibility
[2026-06-14 14:12:24,936] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:94] [PID:3393] Patched Trainer.evaluation_loop with nanmean loss calculation
[2026-06-14 14:12:24,937] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:148] [PID:3393] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
[2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:207] [PID:3393] Patched Gemma4TextAttention.forward with fused RMSNorm+RoPE Triton kernels
[2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:211] [PID:3393] Installed Gemma4 shared_kv_states side channel (PR #3611)
[2026-06-14 14:12:25,062] [INFO] [axolotl.integrations.cut_cross_entropy.pre_model_load:94] [PID:3393] Applying Cut Cross Entropy to model type: gemma4

Loading weights:   0%|                                                                                                                                               | 0/2076 [00:00<?, ?it/s]
Loading weights:   2%|██▊                                                                                                                                  | 43/2076 [00:00<00:04, 421.41it/s]
Loading weights:   4%|█████▌                                                                                                                               | 86/2076 [00:00<00:04, 408.21it/s]
Loading weights:   7%|████████▊                                                                                                                           | 139/2076 [00:00<00:04, 422.30it/s]
Loading weights:   9%|████████████                                                                                                                        | 189/2076 [00:00<00:04, 437.88it/s]
Loading weights:  11%|██████████████▊                                                                                                                     | 233/2076 [00:00<00:04, 414.25it/s]
Loading weights:  13%|█████████████████▍                                                                                                                  | 275/2076 [00:00<00:05, 357.36it/s]
Loading weights:  15%|████████████████████▏                                                                                                               | 318/2076 [00:00<00:04, 371.39it/s]
Loading weights:  18%|███████████████████████▌                                                                                                            | 370/2076 [00:00<00:04, 408.91it/s]
Loading weights:  20%|██████████████████████████▏                                                                                                         | 412/2076 [00:01<00:04, 376.62it/s]
Loading weights:  22%|████████████████████████████▋                                                                                                       | 451/2076 [00:01<00:04, 358.21it/s]
Loading weights:  24%|███████████████████████████████▋                                                                                                    | 499/2076 [00:01<00:04, 387.37it/s]
Loading weights:  26%|██████████████████████████████████▎                                                                                                 | 539/2076 [00:01<00:04, 372.42it/s]
Loading weights:  28%|████████████████████████████████████▋                                                                                               | 577/2076 [00:01<00:04, 361.99it/s]
Loading weights:  30%|███████████████████████████████████████▌                                                                                            | 623/2076 [00:01<00:03, 386.03it/s]
Loading weights:  32%|██████████████████████████████████████████▏                                                                                         | 663/2076 [00:01<00:03, 376.76it/s]
Loading weights:  34%|████████████████████████████████████████████▋                                                                                       | 702/2076 [00:01<00:03, 343.95it/s]
Loading weights:  36%|███████████████████████████████████████████████▉                                                                                    | 753/2076 [00:01<00:03, 379.34it/s]
Loading weights:  38%|██████████████████████████████████████████████████▋                                                                                  | 792/2076 [00:04<00:28, 44.87it/s]
Loading weights:  39%|████████████████████████████████████████████████████▌                                                                                | 820/2076 [00:05<00:29, 42.60it/s]
Loading weights:  41%|█████████████████████████████████████████████████████▉                                                                               | 841/2076 [00:06<00:29, 42.47it/s]
Loading weights:  41%|██████████████████████████████████████████████████████▉                                                                              | 857/2076 [00:06<00:30, 39.90it/s]
Loading weights:  42%|███████████████████████████████████████████████████████▋                                                                             | 869/2076 [00:07<00:32, 37.29it/s]
Loading weights:  42%|████████████████████████████████████████████████████████▎                                                                            | 879/2076 [00:07<00:35, 33.32it/s]
Loading weights:  43%|█████████████████████████████████████████████████████████▎                                                                           | 894/2076 [00:07<00:30, 38.83it/s]
Loading weights:  43%|█████████████████████████████████████████████████████████▊                                                                           | 902/2076 [00:08<00:31, 37.01it/s]
Loading weights:  44%|██████████████████████████████████████████████████████████▎                                                                          | 911/2076 [00:08<00:31, 37.35it/s]
Loading weights:  44%|██████████████████████████████████████████████████████████▋                                                                          | 917/2076 [00:08<00:34, 33.24it/s]
Loading weights:  45%|███████████████████████████████████████████████████████████▍                                                                         | 928/2076 [00:08<00:31, 36.45it/s]
Loading weights:  45%|███████████████████████████████████████████████████████████▊                                                                         | 933/2076 [00:09<00:36, 30.96it/s]
Loading weights:  46%|████████████████████████████████████████████████████████████▌                                                                        | 945/2076 [00:09<00:30, 36.87it/s]
Loading weights:  46%|████████████████████████████████████████████████████████████▊                                                                        | 950/2076 [00:09<00:37, 29.74it/s]
Loading weights:  46%|█████████████████████████████████████████████████████████████▎                                                                       | 958/2076 [00:09<00:31, 35.53it/s]
Loading weights:  46%|█████████████████████████████████████████████████████████████▋                                                                       | 963/2076 [00:10<00:38, 29.12it/s]
Loading weights:  47%|█████████████████████████████████████████████████████████████▉                                                                       | 967/2076 [00:10<00:38, 28.53it/s]
Loading weights:  47%|██████████████████████████████████████████████████████████████▋                                                                      | 979/2076 [00:10<00:28, 38.12it/s]
Loading weights:  47%|███████████████████████████████████████████████████████████████                                                                      | 984/2076 [00:10<00:34, 31.40it/s]
Loading weights:  48%|███████████████████████████████████████████████████████████████▊                                                                     | 996/2076 [00:10<00:28, 38.38it/s]
Loading weights:  48%|███████████████████████████████████████████████████████████████▋                                                                    | 1001/2076 [00:11<00:35, 30.58it/s]
Loading weights:  49%|████████████████████████████████████████████████████████████████▍                                                                   | 1013/2076 [00:11<00:27, 38.54it/s]
Loading weights:  49%|████████████████████████████████████████████████████████████████▋                                                                   | 1018/2076 [00:11<00:34, 30.96it/s]
Loading weights:  50%|█████████████████████████████████████████████████████████████████▍                                                                  | 1030/2076 [00:11<00:26, 39.49it/s]
Loading weights:  50%|█████████████████████████████████████████████████████████████████▊                                                                  | 1035/2076 [00:12<00:32, 32.23it/s]
Loading weights:  50%|██████████████████████████████████████████████████████████████████▌                                                                 | 1047/2076 [00:12<00:25, 40.18it/s]
Loading weights:  51%|██████████████████████████████████████████████████████████████████▉                                                                 | 1052/2076 [00:12<00:31, 32.63it/s]
Loading weights:  51%|███████████████████████████████████████████████████████████████████▍                                                                | 1060/2076 [00:12<00:26, 39.02it/s]
Loading weights:  51%|███████████████████████████████████████████████████████████████████▋                                                                | 1065/2076 [00:12<00:32, 31.33it/s]
Loading weights:  51%|███████████████████████████████████████████████████████████████████▉                                                                | 1069/2076 [00:13<00:33, 29.94it/s]
Loading weights:  52%|████████████████████████████████████████████████████████████████████▋                                                               | 1081/2076 [00:13<00:24, 40.31it/s]
Loading weights:  52%|█████████████████████████████████████████████████████████████████████                                                               | 1086/2076 [00:13<00:31, 31.80it/s]
Loading weights:  53%|█████████████████████████████████████████████████████████████████████▊                                                              | 1098/2076 [00:13<00:24, 39.81it/s]
Loading weights:  53%|██████████████████████████████████████████████████████████████████████▏                                                             | 1103/2076 [00:13<00:30, 32.41it/s]
Loading weights:  54%|██████████████████████████████████████████████████████████████████████▉                                                             | 1115/2076 [00:14<00:24, 39.72it/s]
Loading weights:  54%|███████████████████████████████████████████████████████████████████████▏                                                            | 1120/2076 [00:14<00:29, 32.22it/s]
Loading weights:  55%|███████████████████████████████████████████████████████████████████████▉                                                            | 1132/2076 [00:14<00:24, 38.07it/s]
Loading weights:  55%|████████████████████████████████████████████████████████████████████████▎                                                           | 1137/2076 [00:14<00:30, 31.08it/s]
Loading weights:  55%|█████████████████████████████████████████████████████████████████████████                                                           | 1149/2076 [00:15<00:24, 38.34it/s]
Loading weights:  56%|█████████████████████████████████████████████████████████████████████████▍                                                          | 1154/2076 [00:15<00:29, 31.05it/s]
Loading weights:  56%|█████████████████████████████████████████████████████████████████████████▉                                                          | 1162/2076 [00:15<00:24, 37.87it/s]
Loading weights:  56%|██████████████████████████████████████████████████████████████████████████▏                                                         | 1167/2076 [00:15<00:29, 31.18it/s]
Loading weights:  56%|██████████████████████████████████████████████████████████████████████████▍                                                         | 1171/2076 [00:15<00:30, 30.09it/s]
Loading weights:  57%|███████████████████████████████████████████████████████████████████████████                                                         | 1180/2076 [00:16<00:24, 37.17it/s]
Loading weights:  57%|███████████████████████████████████████████████████████████████████████████▎                                                        | 1185/2076 [00:16<00:30, 28.83it/s]
Loading weights:  58%|███████████████████████████████████████████████████████████████████████████▉                                                        | 1194/2076 [00:16<00:25, 34.39it/s]
Loading weights:  58%|████████████████████████████████████████████████████████████████████████████▏                                                       | 1198/2076 [00:16<00:35, 24.64it/s]
Loading weights:  58%|████████████████████████████████████████████████████████████████████████████▊                                                       | 1208/2076 [00:17<00:29, 29.76it/s]
Loading weights:  58%|█████████████████████████████████████████████████████████████████████████████                                                       | 1212/2076 [00:17<00:36, 23.76it/s]
Loading weights:  59%|█████████████████████████████████████████████████████████████████████████████▋                                                      | 1222/2076 [00:17<00:27, 30.84it/s]
Loading weights:  59%|█████████████████████████████████████████████████████████████████████████████▉                                                      | 1226/2076 [00:18<00:34, 25.00it/s]
Loading weights:  60%|██████████████████████████████████████████████████████████████████████████████▌                                                     | 1236/2076 [00:18<00:26, 31.77it/s]
Loading weights:  60%|██████████████████████████████████████████████████████████████████████████████▊                                                     | 1240/2076 [00:18<00:34, 24.48it/s]
Loading weights:  60%|███████████████████████████████████████████████████████████████████████████████▎                                                    | 1247/2076 [00:18<00:27, 30.31it/s]
Loading weights:  60%|███████████████████████████████████████████████████████████████████████████████▌                                                    | 1251/2076 [00:18<00:33, 24.70it/s]
Loading weights:  60%|███████████████████████████████████████████████████████████████████████████████▊                                                    | 1255/2076 [00:19<00:32, 25.20it/s]
Loading weights:  61%|████████████████████████████████████████████████████████████████████████████████▎                                                   | 1264/2076 [00:19<00:26, 30.67it/s]
Loading weights:  61%|████████████████████████████████████████████████████████████████████████████████▌                                                   | 1268/2076 [00:19<00:34, 23.41it/s]
Loading weights:  62%|█████████████████████████████████████████████████████████████████████████████████▎                                                  | 1278/2076 [00:19<00:26, 30.22it/s]
Loading weights:  62%|█████████████████████████████████████████████████████████████████████████████████▌                                                  | 1282/2076 [00:20<00:31, 24.81it/s]
Loading weights:  62%|██████████████████████████████████████████████████████████████████████████████████▏                                                 | 1292/2076 [00:20<00:25, 31.02it/s]
Loading weights:  62%|██████████████████████████████████████████████████████████████████████████████████▍                                                 | 1296/2076 [00:20<00:32, 23.64it/s]
Loading weights:  63%|███████████████████████████████████████████████████████████████████████████████████                                                 | 1306/2076 [00:20<00:25, 30.34it/s]
Loading weights:  63%|███████████████████████████████████████████████████████████████████████████████████▎                                                | 1310/2076 [00:21<00:32, 23.87it/s]
Loading weights:  64%|███████████████████████████████████████████████████████████████████████████████████▉                                                | 1320/2076 [00:21<00:25, 29.52it/s]
Loading weights:  64%|████████████████████████████████████████████████████████████████████████████████████▏                                               | 1324/2076 [00:21<00:30, 24.38it/s]
Loading weights:  64%|████████████████████████████████████████████████████████████████████████████████████▋                                               | 1331/2076 [00:21<00:24, 30.50it/s]
Loading weights:  64%|████████████████████████████████████████████████████████████████████████████████████▉                                               | 1335/2076 [00:22<00:30, 24.15it/s]
Loading weights:  64%|█████████████████████████████████████████████████████████████████████████████████████▏                                              | 1339/2076 [00:22<00:29, 25.27it/s]
Loading weights:  65%|█████████████████████████████████████████████████████████████████████████████████████▋                                              | 1348/2076 [00:22<00:22, 32.23it/s]
Loading weights:  65%|█████████████████████████████████████████████████████████████████████████████████████▉                                              | 1352/2076 [00:22<00:29, 24.95it/s]
Loading weights:  66%|██████████████████████████████████████████████████████████████████████████████████████▌                                             | 1362/2076 [00:22<00:22, 32.00it/s]
Loading weights:  66%|██████████████████████████████████████████████████████████████████████████████████████▊                                             | 1366/2076 [00:23<00:28, 25.10it/s]
Loading weights:  66%|███████████████████████████████████████████████████████████████████████████████████████▍                                            | 1376/2076 [00:23<00:21, 31.93it/s]
Loading weights:  66%|███████████████████████████████████████████████████████████████████████████████████████▋                                            | 1380/2076 [00:23<00:27, 25.08it/s]
Loading weights:  67%|████████████████████████████████████████████████████████████████████████████████████████▍                                           | 1390/2076 [00:23<00:21, 32.17it/s]
Loading weights:  67%|████████████████████████████████████████████████████████████████████████████████████████▋                                           | 1394/2076 [00:24<00:27, 24.57it/s]
Loading weights:  68%|█████████████████████████████████████████████████████████████████████████████████████████▎                                          | 1404/2076 [00:24<00:21, 31.22it/s]
Loading weights:  68%|█████████████████████████████████████████████████████████████████████████████████████████▌                                          | 1408/2076 [00:24<00:26, 25.40it/s]
Loading weights:  68%|█████████████████████████████████████████████████████████████████████████████████████████▉                                          | 1415/2076 [00:24<00:21, 30.97it/s]
Loading weights:  68%|██████████████████████████████████████████████████████████████████████████████████████████▏                                         | 1419/2076 [00:24<00:22, 29.70it/s]
Loading weights:  76%|███████████████████████████████████████████████████████████████████████████████████████████████████                                | 1569/2076 [00:24<00:01, 293.08it/s]
Loading weights:  83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                      | 1714/2076 [00:25<00:00, 528.51it/s]
Loading weights:  90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌            | 1878/2076 [00:25<00:00, 774.40it/s]
Loading weights:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊  | 2042/2076 [00:25<00:00, 981.87it/s]
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2076/2076 [00:25<00:00, 81.97it/s]
[2026-06-14 14:12:52,761] [INFO] [axolotl.loaders.model._prepare_model_for_quantization:977] [PID:3393] converting PEFT model w/ prepare_model_for_kbit_training
[2026-06-14 14:12:52,780] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:433] [PID:3393] Converting modules to torch.bfloat16
[2026-06-14 14:12:52,965] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] Memory usage after model load 22.419GB (+22.419GB allocated, +28.939GB reserved)
trainable params: 34,881,536 || all params: 7,975,982,368 || trainable%: 0.4373
[2026-06-14 14:12:53,476] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] after adapters 10.827GB (+10.827GB allocated, +29.068GB reserved)
[2026-06-14 14:12:55,145] [INFO] [axolotl.utils.freeze.freeze_mm_modules:49] [PID:3393] freeze_mm_modules: froze 0 vision/audio parameters
[2026-06-14 14:12:56,196] [INFO] [axolotl.core.trainers.mixins.layer_offloading.__init__:291] [PID:3393] Layer parameter offloading enabled
[2026-06-14 14:12:56,197] [WARNING] [axolotl.core.trainers.mixins.layer_offloading.__init__:73] [PID:3393] LayerOffloadManager: no decoder layers found, offloading disabled
[2026-06-14 14:12:56,197] [INFO] [axolotl.train.save_initial_configs:450] [PID:3393] Pre-saving adapter config to ./outputs/Jacob-2-E4B...
[2026-06-14 14:12:56,198] [INFO] [axolotl.train.save_initial_configs:454] [PID:3393] Pre-saving tokenizer to ./outputs/Jacob-2-E4B...
[2026-06-14 14:12:56,696] [INFO] [axolotl.train.save_initial_configs:459] [PID:3393] Pre-saving model config to ./outputs/Jacob-2-E4B...
[2026-06-14 14:12:56,700] [INFO] [axolotl.train.save_initial_configs:463] [PID:3393] Pre-saving processor to ./outputs/Jacob-2-E4B...
[2026-06-14 14:12:57,113] [INFO] [axolotl.train.execute_training:226] [PID:3393] Starting trainer...

  0%|                                                                                                                                                                 | 0/263 [00:00<?, ?it/s][2026-06-14 14:13:00,854] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:13:01,453] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:13:11,380] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:13:46,333] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:13:46,745] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:14:09,129] [INFO] [axolotl.kernels.autotune_telemetry.on_step_end:133] [PID:3393] Reported 2 fused-rope kernel autotune config(s) to telemetry.

  0%|▌                                                                                                                                                      | 1/263 [01:11<5:11:11, 71.27s/it]
                                                                                                                                                                                              
{'loss': '1.576', 'grad_norm': '0.5319', 'learning_rate': '0', 'ppl': '4.836', 'memory/max_active (GiB)': '35.37', 'memory/max_allocated (GiB)': '35.37', 'memory/device_reserved (GiB)': '44.53', 'tokens/trainable': 13646, 'tokens/total': 34320, 'epoch': '0.003815'}

  0%|▌                                                                                                                                                      | 1/263 [01:11<5:11:11, 71.27s/it]
  1%|█▏                                                                                                                                                     | 2/263 [02:07<4:31:48, 62.49s/it]
                                                                                                                                                                                              
{'loss': '1.541', 'grad_norm': '0.4915', 'learning_rate': '7.692e-06', 'ppl': '4.671', 'memory/max_active (GiB)': '37.15', 'memory/max_allocated (GiB)': '37.15', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '272.8', 'tokens/trainable': 29007, 'tokens/total': 65680, 'epoch': '0.00763'}

  1%|█▏                                                                                                                                                     | 2/263 [02:07<4:31:48, 62.49s/it]
  1%|█▋                                                                                                                                                     | 3/263 [02:52<3:55:45, 54.40s/it]
                                                                                                                                                                                              
{'loss': '1.676', 'grad_norm': '0.5334', 'learning_rate': '1.538e-05', 'ppl': '5.345', 'memory/max_active (GiB)': '26.41', 'memory/max_allocated (GiB)': '26.41', 'memory/device_reserved (GiB)': '34.21', 'tokens/train_per_sec_per_gpu': '241.3', 'tokens/trainable': 39814, 'tokens/total': 89258, 'epoch': '0.01144'}

  1%|█▋                                                                                                                                                     | 3/263 [02:52<3:55:45, 54.40s/it]
  2%|██▎                                                                                                                                                    | 4/263 [03:26<3:20:53, 46.54s/it]
                                                                                                                                                                                              
{'loss': '1.701', 'grad_norm': '0.6123', 'learning_rate': '2.308e-05', 'ppl': '5.478', 'memory/max_active (GiB)': '22.96', 'memory/max_allocated (GiB)': '22.96', 'memory/device_reserved (GiB)': '25.77', 'tokens/train_per_sec_per_gpu': '256.6', 'tokens/trainable': 48660, 'tokens/total': 108024, 'epoch': '0.01526'}

  2%|██▎                                                                                                                                                    | 4/263 [03:26<3:20:53, 46.54s/it]
  2%|██▊                                                                                                                                                    | 5/263 [04:08<3:12:35, 44.79s/it]
                                                                                                                                                                                              
{'loss': '1.721', 'grad_norm': '0.6678', 'learning_rate': '3.077e-05', 'ppl': '5.59', 'memory/max_active (GiB)': '24.42', 'memory/max_allocated (GiB)': '24.42', 'memory/device_reserved (GiB)': '31.26', 'tokens/train_per_sec_per_gpu': '200.1', 'tokens/trainable': 57004, 'tokens/total': 127360, 'epoch': '0.01907'}

  2%|██▊                                                                                                                                                    | 5/263 [04:08<3:12:35, 44.79s/it]
  2%|███▍                                                                                                                                                   | 6/263 [05:00<3:22:41, 47.32s/it]
                                                                                                                                                                                              
{'loss': '1.441', 'grad_norm': '0.5136', 'learning_rate': '3.846e-05', 'ppl': '4.225', 'memory/max_active (GiB)': '43.28', 'memory/max_allocated (GiB)': '43.28', 'memory/device_reserved (GiB)': '60.49', 'tokens/train_per_sec_per_gpu': '340.5', 'tokens/trainable': 74786, 'tokens/total': 165128, 'epoch': '0.02289'}

  2%|███▍                                                                                                                                                   | 6/263 [05:00<3:22:41, 47.32s/it]
  3%|████                                                                                                                                                   | 7/263 [05:43<3:15:55, 45.92s/it]
                                                                                                                                                                                              
{'loss': '1.527', 'grad_norm': '0.5568', 'learning_rate': '4.615e-05', 'ppl': '4.603', 'memory/max_active (GiB)': '38.81', 'memory/max_allocated (GiB)': '38.81', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '290.1', 'tokens/trainable': 87272, 'tokens/total': 194790, 'epoch': '0.0267'}

  3%|████                                                                                                                                                   | 7/263 [05:43<3:15:55, 45.92s/it]
  3%|████▌                                                                                                                                                  | 8/263 [06:32<3:19:30, 46.94s/it]
                                                                                                                                                                                              
{'loss': '1.629', 'grad_norm': '0.7715', 'learning_rate': '5.385e-05', 'ppl': '5.098', 'memory/max_active (GiB)': '34.9', 'memory/max_allocated (GiB)': '34.9', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '247.6', 'tokens/trainable': 99438, 'tokens/total': 223594, 'epoch': '0.03052'}

  3%|████▌                                                                                                                                                  | 8/263 [06:32<3:19:30, 46.94s/it]
  3%|█████▏                                                                                                                                                 | 9/263 [07:20<3:19:53, 47.22s/it]
                                                                                                                                                                                              
{'loss': '1.545', 'grad_norm': '0.7755', 'learning_rate': '6.154e-05', 'ppl': '4.688', 'memory/max_active (GiB)': '42.92', 'memory/max_allocated (GiB)': '42.92', 'memory/device_reserved (GiB)': '59.79', 'tokens/train_per_sec_per_gpu': '298.1', 'tokens/trainable': 113695, 'tokens/total': 256322, 'epoch': '0.03433'}

  3%|█████▏                                                                                                                                                 | 9/263 [07:20<3:19:53, 47.22s/it]
  4%|█████▋                                                                                                                                                | 10/263 [08:10<3:21:47, 47.86s/it]
                                                                                                                                                                                              
{'loss': '1.348', 'grad_norm': '0.6723', 'learning_rate': '6.923e-05', 'ppl': '3.848', 'memory/max_active (GiB)': '28.92', 'memory/max_allocated (GiB)': '28.92', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '238.3', 'tokens/trainable': 125442, 'tokens/total': 282552, 'epoch': '0.03815'}

  4%|█████▋                                                                                                                                                | 10/263 [08:10<3:21:47, 47.86s/it]
  4%|██████▎                                                                                                                                               | 11/263 [08:50<3:10:58, 45.47s/it]
                                                                                                                                                                                              
{'loss': '1.337', 'grad_norm': '0.701', 'learning_rate': '7.692e-05', 'ppl': '3.808', 'memory/max_active (GiB)': '25.92', 'memory/max_allocated (GiB)': '25.92', 'memory/device_reserved (GiB)': '33.79', 'tokens/train_per_sec_per_gpu': '275.6', 'tokens/trainable': 136483, 'tokens/total': 306800, 'epoch': '0.04196'}

  4%|██████▎                                                                                                                                               | 11/263 [08:50<3:10:58, 45.47s/it]
  5%|██████▊                                                                                                                                               | 12/263 [09:22<2:53:32, 41.48s/it]
                                                                                                                                                                                              
{'loss': '1.188', 'grad_norm': '0.5657', 'learning_rate': '8.462e-05', 'ppl': '3.28', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.74', 'tokens/train_per_sec_per_gpu': '364.8', 'tokens/trainable': 148289, 'tokens/total': 329988, 'epoch': '0.04578'}

  5%|██████▊                                                                                                                                               | 12/263 [09:22<2:53:32, 41.48s/it]
  5%|███████▍                                                                                                                                              | 13/263 [10:18<3:11:41, 46.01s/it]
                                                                                                                                                                                              
{'loss': '1.115', 'grad_norm': '0.5181', 'learning_rate': '9.231e-05', 'ppl': '3.048', 'memory/max_active (GiB)': '48.15', 'memory/max_allocated (GiB)': '48.15', 'memory/device_reserved (GiB)': '67.89', 'tokens/train_per_sec_per_gpu': '312.6', 'tokens/trainable': 165922, 'tokens/total': 365046, 'epoch': '0.04959'}

  5%|███████▍                                                                                                                                              | 13/263 [10:18<3:11:41, 46.01s/it]
  5%|███████▉                                                                                                                                              | 14/263 [11:08<3:15:29, 47.11s/it]
                                                                                                                                                                                              
{'loss': '1.097', 'grad_norm': '0.5891', 'learning_rate': '0.0001', 'ppl': '2.995', 'memory/max_active (GiB)': '37.46', 'memory/max_allocated (GiB)': '37.46', 'memory/device_reserved (GiB)': '47.39', 'tokens/train_per_sec_per_gpu': '291.3', 'tokens/trainable': 180382, 'tokens/total': 398168, 'epoch': '0.05341'}

  5%|███████▉                                                                                                                                              | 14/263 [11:08<3:15:29, 47.11s/it]
  6%|████████▌                                                                                                                                             | 15/263 [11:59<3:19:12, 48.19s/it]
                                                                                                                                                                                              
{'loss': '1.005', 'grad_norm': '0.5228', 'learning_rate': '0.0001077', 'ppl': '2.732', 'memory/max_active (GiB)': '39.68', 'memory/max_allocated (GiB)': '39.68', 'memory/device_reserved (GiB)': '55.06', 'tokens/train_per_sec_per_gpu': '245.6', 'tokens/trainable': 192840, 'tokens/total': 425746, 'epoch': '0.05722'}

  6%|████████▌                                                                                                                                             | 15/263 [11:59<3:19:12, 48.19s/it]
  6%|█████████▏                                                                                                                                            | 16/263 [12:52<3:24:35, 49.70s/it]
                                                                                                                                                                                              
{'loss': '0.871', 'grad_norm': '0.36', 'learning_rate': '0.0001154', 'ppl': '2.389', 'memory/max_active (GiB)': '50.71', 'memory/max_allocated (GiB)': '50.71', 'memory/device_reserved (GiB)': '71.7', 'tokens/train_per_sec_per_gpu': '322.6', 'tokens/trainable': 209997, 'tokens/total': 464790, 'epoch': '0.06104'}

  6%|█████████▏                                                                                                                                            | 16/263 [12:52<3:24:35, 49.70s/it]
  6%|█████████▋                                                                                                                                            | 17/263 [13:45<3:27:34, 50.63s/it]
                                                                                                                                                                                              
{'loss': '0.98', 'grad_norm': '0.5478', 'learning_rate': '0.0001231', 'ppl': '2.664', 'memory/max_active (GiB)': '59.65', 'memory/max_allocated (GiB)': '59.65', 'memory/device_reserved (GiB)': '77.29', 'tokens/train_per_sec_per_gpu': '331.7', 'tokens/trainable': 227508, 'tokens/total': 502040, 'epoch': '0.06485'}

  6%|█████████▋                                                                                                                                            | 17/263 [13:45<3:27:34, 50.63s/it]
  7%|██████████▎                                                                                                                                           | 18/263 [14:33<3:24:00, 49.96s/it]
                                                                                                                                                                                              
{'loss': '1.008', 'grad_norm': '0.3831', 'learning_rate': '0.0001308', 'ppl': '2.74', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '52.24', 'tokens/train_per_sec_per_gpu': '315.6', 'tokens/trainable': 242788, 'tokens/total': 533846, 'epoch': '0.06867'}

  7%|██████████▎                                                                                                                                           | 18/263 [14:33<3:24:00, 49.96s/it]
  7%|██████████▊                                                                                                                                           | 19/263 [15:13<3:10:30, 46.85s/it]
                                                                                                                                                                                              
{'loss': '0.9226', 'grad_norm': '0.4333', 'learning_rate': '0.0001385', 'ppl': '2.516', 'memory/max_active (GiB)': '41.82', 'memory/max_allocated (GiB)': '41.82', 'memory/device_reserved (GiB)': '58.38', 'tokens/train_per_sec_per_gpu': '382.9', 'tokens/trainable': 257947, 'tokens/total': 566334, 'epoch': '0.07248'}

  7%|██████████▊                                                                                                                                           | 19/263 [15:13<3:10:30, 46.85s/it][2026-06-14 14:28:28,050] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

[2026-06-14 14:28:28,323] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)


  8%|███████████▍                                                                                                                                          | 20/263 [16:11<3:23:38, 50.28s/it]
                                                                                                                                                                                              
{'loss': '0.9231', 'grad_norm': '0.4565', 'learning_rate': '0.0001462', 'ppl': '2.517', 'memory/max_active (GiB)': '37.11', 'memory/max_allocated (GiB)': '37.11', 'memory/device_reserved (GiB)': '46.94', 'tokens/train_per_sec_per_gpu': '361.3', 'tokens/trainable': 279009, 'tokens/total': 610288, 'epoch': '0.0763'}

  8%|███████████▍                                                                                                                                          | 20/263 [16:11<3:23:38, 50.28s/it]
  8%|███████████▉                                                                                                                                          | 21/263 [17:02<3:23:17, 50.40s/it]
                                                                                                                                                                                              
{'loss': '0.9047', 'grad_norm': '0.5018', 'learning_rate': '0.0001538', 'ppl': '2.471', 'memory/max_active (GiB)': '33.83', 'memory/max_allocated (GiB)': '33.83', 'memory/device_reserved (GiB)': '42.33', 'tokens/train_per_sec_per_gpu': '243', 'tokens/trainable': 291324, 'tokens/total': 637560, 'epoch': '0.08011'}

  8%|███████████▉                                                                                                                                          | 21/263 [17:02<3:23:17, 50.40s/it]
  8%|████████████▌                                                                                                                                         | 22/263 [17:44<3:12:51, 48.01s/it]
                                                                                                                                                                                              
{'loss': '0.892', 'grad_norm': '0.4374', 'learning_rate': '0.0001615', 'ppl': '2.44', 'memory/max_active (GiB)': '31.34', 'memory/max_allocated (GiB)': '31.34', 'memory/device_reserved (GiB)': '41.63', 'tokens/train_per_sec_per_gpu': '440', 'tokens/trainable': 309998, 'tokens/total': 672066, 'epoch': '0.08393'}

  8%|████████████▌                                                                                                                                         | 22/263 [17:44<3:12:51, 48.01s/it]
  9%|█████████████                                                                                                                                         | 23/263 [18:16<2:52:03, 43.01s/it]
                                                                                                                                                                                              
{'loss': '1', 'grad_norm': '0.3918', 'learning_rate': '0.0001692', 'ppl': '2.719', 'memory/max_active (GiB)': '27.48', 'memory/max_allocated (GiB)': '27.48', 'memory/device_reserved (GiB)': '35.84', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 322055, 'tokens/total': 698356, 'epoch': '0.08774'}

  9%|█████████████                                                                                                                                         | 23/263 [18:16<2:52:03, 43.01s/it]
  9%|█████████████▋                                                                                                                                        | 24/263 [18:42<2:31:15, 37.97s/it]
                                                                                                                                                                                              
{'loss': '0.9319', 'grad_norm': '0.3207', 'learning_rate': '0.0001769', 'ppl': '2.539', 'memory/max_active (GiB)': '22.66', 'memory/max_allocated (GiB)': '22.66', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '382.2', 'tokens/trainable': 332075, 'tokens/total': 718552, 'epoch': '0.09156'}

  9%|█████████████▋                                                                                                                                        | 24/263 [18:42<2:31:15, 37.97s/it]
 10%|██████████████▎                                                                                                                                       | 25/263 [19:26<2:38:17, 39.90s/it]
                                                                                                                                                                                              
{'loss': '0.8685', 'grad_norm': '0.2337', 'learning_rate': '0.0001846', 'ppl': '2.383', 'memory/max_active (GiB)': '31.11', 'memory/max_allocated (GiB)': '31.11', 'memory/device_reserved (GiB)': '38.38', 'tokens/train_per_sec_per_gpu': '287', 'tokens/trainable': 344822, 'tokens/total': 746628, 'epoch': '0.09537'}

 10%|██████████████▎                                                                                                                                       | 25/263 [19:26<2:38:17, 39.90s/it]
 10%|██████████████▊                                                                                                                                       | 26/263 [20:17<2:50:36, 43.19s/it]
                                                                                                                                                                                              
{'loss': '0.9079', 'grad_norm': '0.2105', 'learning_rate': '0.0001923', 'ppl': '2.479', 'memory/max_active (GiB)': '27.1', 'memory/max_allocated (GiB)': '27.1', 'memory/device_reserved (GiB)': '35.33', 'tokens/train_per_sec_per_gpu': '241.9', 'tokens/trainable': 357129, 'tokens/total': 775900, 'epoch': '0.09919'}

 10%|██████████████▊                                                                                                                                       | 26/263 [20:17<2:50:36, 43.19s/it]
 10%|███████████████▍                                                                                                                                      | 27/263 [20:59<2:48:53, 42.94s/it]
                                                                                                                                                                                              
{'loss': '0.9097', 'grad_norm': '0.2115', 'learning_rate': '0.0002', 'ppl': '2.484', 'memory/max_active (GiB)': '45.8', 'memory/max_allocated (GiB)': '45.8', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '297.6', 'tokens/trainable': 369730, 'tokens/total': 802730, 'epoch': '0.103'}

 10%|███████████████▍                                                                                                                                      | 27/263 [20:59<2:48:53, 42.94s/it]
 11%|███████████████▉                                                                                                                                      | 28/263 [21:41<2:46:11, 42.43s/it]
                                                                                                                                                                                              
{'loss': '0.8107', 'grad_norm': '0.1699', 'learning_rate': '0.0002', 'ppl': '2.249', 'memory/max_active (GiB)': '39.66', 'memory/max_allocated (GiB)': '39.66', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '399.8', 'tokens/trainable': 386221, 'tokens/total': 839556, 'epoch': '0.1068'}

 11%|███████████████▉                                                                                                                                      | 28/263 [21:41<2:46:11, 42.43s/it]
 11%|████████████████▌                                                                                                                                     | 29/263 [22:13<2:33:22, 39.33s/it]
                                                                                                                                                                                              
{'loss': '0.8618', 'grad_norm': '0.188', 'learning_rate': '0.0002', 'ppl': '2.367', 'memory/max_active (GiB)': '30.9', 'memory/max_allocated (GiB)': '30.9', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '358.7', 'tokens/trainable': 397730, 'tokens/total': 866216, 'epoch': '0.1106'}

 11%|████████████████▌                                                                                                                                     | 29/263 [22:13<2:33:22, 39.33s/it]
 11%|█████████████████                                                                                                                                     | 30/263 [22:51<2:31:11, 38.93s/it]
                                                                                                                                                                                              
{'loss': '0.7588', 'grad_norm': '0.1495', 'learning_rate': '0.0001999', 'ppl': '2.136', 'memory/max_active (GiB)': '31.81', 'memory/max_allocated (GiB)': '31.81', 'memory/device_reserved (GiB)': '42.5', 'tokens/train_per_sec_per_gpu': '420', 'tokens/trainable': 413695, 'tokens/total': 898918, 'epoch': '0.1144'}

 11%|█████████████████                                                                                                                                     | 30/263 [22:51<2:31:11, 38.93s/it]
 12%|█████████████████▋                                                                                                                                    | 31/263 [23:27<2:27:18, 38.10s/it]
                                                                                                                                                                                              
{'loss': '0.7096', 'grad_norm': '0.1585', 'learning_rate': '0.0001999', 'ppl': '2.033', 'memory/max_active (GiB)': '26.02', 'memory/max_allocated (GiB)': '26.02', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '283.3', 'tokens/trainable': 423936, 'tokens/total': 923272, 'epoch': '0.1183'}

 12%|█████████████████▋                                                                                                                                    | 31/263 [23:27<2:27:18, 38.10s/it]
 12%|██████████████████▎                                                                                                                                   | 32/263 [24:16<2:39:47, 41.50s/it]
                                                                                                                                                                                              
{'loss': '0.7129', 'grad_norm': '0.1396', 'learning_rate': '0.0001998', 'ppl': '2.04', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.17', 'tokens/train_per_sec_per_gpu': '434.5', 'tokens/trainable': 445423, 'tokens/total': 968072, 'epoch': '0.1221'}

 12%|██████████████████▎                                                                                                                                   | 32/263 [24:16<2:39:47, 41.50s/it]
 13%|██████████████████▊                                                                                                                                   | 33/263 [24:52<2:32:16, 39.72s/it]
                                                                                                                                                                                              
{'loss': '0.8433', 'grad_norm': '0.1598', 'learning_rate': '0.0001997', 'ppl': '2.324', 'memory/max_active (GiB)': '28.78', 'memory/max_allocated (GiB)': '28.78', 'memory/device_reserved (GiB)': '61.66', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 458707, 'tokens/total': 996758, 'epoch': '0.1259'}

 13%|██████████████████▊                                                                                                                                   | 33/263 [24:52<2:32:16, 39.72s/it]
 13%|███████████████████▍                                                                                                                                  | 34/263 [25:44<2:46:13, 43.55s/it]
                                                                                                                                                                                              
{'loss': '0.7034', 'grad_norm': '0.1374', 'learning_rate': '0.0001996', 'ppl': '2.021', 'memory/max_active (GiB)': '47.51', 'memory/max_allocated (GiB)': '47.51', 'memory/device_reserved (GiB)': '67.08', 'tokens/train_per_sec_per_gpu': '380.5', 'tokens/trainable': 478680, 'tokens/total': 1041764, 'epoch': '0.1297'}

 13%|███████████████████▍                                                                                                                                  | 34/263 [25:44<2:46:13, 43.55s/it]
 13%|███████████████████▉                                                                                                                                  | 35/263 [26:23<2:39:55, 42.09s/it]
                                                                                                                                                                                              
{'loss': '0.705', 'grad_norm': '0.1833', 'learning_rate': '0.0001994', 'ppl': '2.024', 'memory/max_active (GiB)': '35.78', 'memory/max_allocated (GiB)': '35.78', 'memory/device_reserved (GiB)': '45.08', 'tokens/train_per_sec_per_gpu': '333.5', 'tokens/trainable': 491578, 'tokens/total': 1071866, 'epoch': '0.1335'}

 13%|███████████████████▉                                                                                                                                  | 35/263 [26:23<2:39:55, 42.09s/it]
 14%|████████████████████▌                                                                                                                                 | 36/263 [27:25<3:01:14, 47.90s/it]
                                                                                                                                                                                              
{'loss': '0.7842', 'grad_norm': '0.129', 'learning_rate': '0.0001993', 'ppl': '2.191', 'memory/max_active (GiB)': '60.96', 'memory/max_allocated (GiB)': '60.96', 'memory/device_reserved (GiB)': '68.99', 'tokens/train_per_sec_per_gpu': '468.1', 'tokens/trainable': 520355, 'tokens/total': 1124412, 'epoch': '0.1373'}

 14%|████████████████████▌                                                                                                                                 | 36/263 [27:25<3:01:14, 47.90s/it]
 14%|█████████████████████                                                                                                                                 | 37/263 [27:58<2:43:52, 43.51s/it]
                                                                                                                                                                                              
{'loss': '0.81', 'grad_norm': '0.164', 'learning_rate': '0.0001991', 'ppl': '2.248', 'memory/max_active (GiB)': '40.05', 'memory/max_allocated (GiB)': '40.05', 'memory/device_reserved (GiB)': '55.65', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 533138, 'tokens/total': 1152862, 'epoch': '0.1412'}

 14%|█████████████████████                                                                                                                                 | 37/263 [27:58<2:43:52, 43.51s/it]
 14%|█████████████████████▋                                                                                                                                | 38/263 [28:42<2:43:52, 43.70s/it]
                                                                                                                                                                                              
{'loss': '0.6781', 'grad_norm': '0.1564', 'learning_rate': '0.0001989', 'ppl': '1.97', 'memory/max_active (GiB)': '49.18', 'memory/max_allocated (GiB)': '49.18', 'memory/device_reserved (GiB)': '69.5', 'tokens/train_per_sec_per_gpu': '413.4', 'tokens/trainable': 551391, 'tokens/total': 1190852, 'epoch': '0.145'}

 14%|█████████████████████▋                                                                                                                                | 38/263 [28:42<2:43:52, 43.70s/it]
 15%|██████████████████████▏                                                                                                                               | 39/263 [29:17<2:33:07, 41.01s/it]
                                                                                                                                                                                              
{'loss': '0.8666', 'grad_norm': '0.1817', 'learning_rate': '0.0001987', 'ppl': '2.379', 'memory/max_active (GiB)': '42.54', 'memory/max_allocated (GiB)': '42.54', 'memory/device_reserved (GiB)': '59.24', 'tokens/train_per_sec_per_gpu': '349.9', 'tokens/trainable': 563547, 'tokens/total': 1219812, 'epoch': '0.1488'}

 15%|██████████████████████▏                                                                                                                               | 39/263 [29:17<2:33:07, 41.01s/it]
 15%|██████████████████████▊                                                                                                                               | 40/263 [29:53<2:27:38, 39.72s/it]
                                                                                                                                                                                              
{'loss': '0.7456', 'grad_norm': '0.1504', 'learning_rate': '0.0001985', 'ppl': '2.108', 'memory/max_active (GiB)': '37.56', 'memory/max_allocated (GiB)': '37.56', 'memory/device_reserved (GiB)': '47.6', 'tokens/train_per_sec_per_gpu': '429.8', 'tokens/trainable': 579326, 'tokens/total': 1251662, 'epoch': '0.1526'}

 15%|██████████████████████▊                                                                                                                               | 40/263 [29:53<2:27:38, 39.72s/it]
 16%|███████████████████████▍                                                                                                                              | 41/263 [30:30<2:23:42, 38.84s/it]
                                                                                                                                                                                              
{'loss': '0.7436', 'grad_norm': '0.1403', 'learning_rate': '0.0001983', 'ppl': '2.103', 'memory/max_active (GiB)': '35.11', 'memory/max_allocated (GiB)': '35.11', 'memory/device_reserved (GiB)': '44.3', 'tokens/train_per_sec_per_gpu': '410.9', 'tokens/trainable': 594442, 'tokens/total': 1285326, 'epoch': '0.1564'}

 16%|███████████████████████▍                                                                                                                              | 41/263 [30:30<2:23:42, 38.84s/it]
 16%|███████████████████████▉                                                                                                                              | 42/263 [31:12<2:26:40, 39.82s/it]
                                                                                                                                                                                              
{'loss': '0.752', 'grad_norm': '0.1654', 'learning_rate': '0.000198', 'ppl': '2.121', 'memory/max_active (GiB)': '31.65', 'memory/max_allocated (GiB)': '31.65', 'memory/device_reserved (GiB)': '41.45', 'tokens/train_per_sec_per_gpu': '317.2', 'tokens/trainable': 607800, 'tokens/total': 1315504, 'epoch': '0.1602'}

 16%|███████████████████████▉                                                                                                                              | 42/263 [31:12<2:26:40, 39.82s/it]
 16%|████████████████████████▌                                                                                                                             | 43/263 [31:54<2:28:01, 40.37s/it]
                                                                                                                                                                                              
{'loss': '0.823', 'grad_norm': '0.1637', 'learning_rate': '0.0001978', 'ppl': '2.277', 'memory/max_active (GiB)': '26.15', 'memory/max_allocated (GiB)': '26.15', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '277.6', 'tokens/trainable': 619363, 'tokens/total': 1338730, 'epoch': '0.164'}

 16%|████████████████████████▌                                                                                                                             | 43/263 [31:54<2:28:01, 40.37s/it]
 17%|█████████████████████████                                                                                                                             | 44/263 [32:34<2:26:37, 40.17s/it]
                                                                                                                                                                                              
{'loss': '0.7822', 'grad_norm': '0.1594', 'learning_rate': '0.0001975', 'ppl': '2.186', 'memory/max_active (GiB)': '36.38', 'memory/max_allocated (GiB)': '36.38', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '375.6', 'tokens/trainable': 634277, 'tokens/total': 1370806, 'epoch': '0.1679'}

 17%|█████████████████████████                                                                                                                             | 44/263 [32:34<2:26:37, 40.17s/it]
 17%|█████████████████████████▋                                                                                                                            | 45/263 [33:11<2:23:02, 39.37s/it]
                                                                                                                                                                                              
{'loss': '0.7665', 'grad_norm': '0.1315', 'learning_rate': '0.0001972', 'ppl': '2.152', 'memory/max_active (GiB)': '32.6', 'memory/max_allocated (GiB)': '32.6', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '406.5', 'tokens/trainable': 649519, 'tokens/total': 1403632, 'epoch': '0.1717'}

 17%|█████████████████████████▋                                                                                                                            | 45/263 [33:11<2:23:02, 39.37s/it]
 17%|██████████████████████████▏                                                                                                                           | 46/263 [33:41<2:12:01, 36.50s/it]
                                                                                                                                                                                              
{'loss': '0.7592', 'grad_norm': '0.1396', 'learning_rate': '0.0001968', 'ppl': '2.137', 'memory/max_active (GiB)': '22.71', 'memory/max_allocated (GiB)': '22.71', 'memory/device_reserved (GiB)': '33.91', 'tokens/train_per_sec_per_gpu': '378.9', 'tokens/trainable': 660815, 'tokens/total': 1426008, 'epoch': '0.1755'}

 17%|██████████████████████████▏                                                                                                                           | 46/263 [33:41<2:12:01, 36.50s/it]
 18%|██████████████████████████▊                                                                                                                           | 47/263 [34:20<2:14:37, 37.40s/it]
                                                                                                                                                                                              
{'loss': '0.7492', 'grad_norm': '0.1523', 'learning_rate': '0.0001965', 'ppl': '2.115', 'memory/max_active (GiB)': '30.88', 'memory/max_allocated (GiB)': '30.88', 'memory/device_reserved (GiB)': '40.8', 'tokens/train_per_sec_per_gpu': '300.6', 'tokens/trainable': 672685, 'tokens/total': 1452446, 'epoch': '0.1793'}

 18%|██████████████████████████▊                                                                                                                           | 47/263 [34:20<2:14:37, 37.40s/it]
 18%|███████████████████████████▍                                                                                                                          | 48/263 [35:11<2:28:02, 41.32s/it]
                                                                                                                                                                                              
{'loss': '0.7758', 'grad_norm': '0.1287', 'learning_rate': '0.0001962', 'ppl': '2.172', 'memory/max_active (GiB)': '46.66', 'memory/max_allocated (GiB)': '46.66', 'memory/device_reserved (GiB)': '65.45', 'tokens/train_per_sec_per_gpu': '309.9', 'tokens/trainable': 688320, 'tokens/total': 1484130, 'epoch': '0.1831'}

 18%|███████████████████████████▍                                                                                                                          | 48/263 [35:11<2:28:02, 41.32s/it]
 19%|███████████████████████████▉                                                                                                                          | 49/263 [35:50<2:24:35, 40.54s/it]
                                                                                                                                                                                              
{'loss': '0.723', 'grad_norm': '0.1277', 'learning_rate': '0.0001958', 'ppl': '2.061', 'memory/max_active (GiB)': '31.37', 'memory/max_allocated (GiB)': '31.37', 'memory/device_reserved (GiB)': '41.67', 'tokens/train_per_sec_per_gpu': '351.3', 'tokens/trainable': 701926, 'tokens/total': 1512564, 'epoch': '0.1869'}

 19%|███████████████████████████▉                                                                                                                          | 49/263 [35:50<2:24:35, 40.54s/it]
 19%|████████████████████████████▌                                                                                                                         | 50/263 [36:34<2:27:53, 41.66s/it]
                                                                                                                                                                                              
{'loss': '0.6734', 'grad_norm': '0.1139', 'learning_rate': '0.0001954', 'ppl': '1.961', 'memory/max_active (GiB)': '38.79', 'memory/max_allocated (GiB)': '38.79', 'memory/device_reserved (GiB)': '53.5', 'tokens/train_per_sec_per_gpu': '458.3', 'tokens/trainable': 722216, 'tokens/total': 1551038, 'epoch': '0.1907'}

 19%|████████████████████████████▌                                                                                                                         | 50/263 [36:34<2:27:53, 41.66s/it]
 19%|█████████████████████████████                                                                                                                         | 51/263 [37:28<2:40:22, 45.39s/it]
                                                                                                                                                                                              
{'loss': '0.7031', 'grad_norm': '0.1693', 'learning_rate': '0.000195', 'ppl': '2.02', 'memory/max_active (GiB)': '45.1', 'memory/max_allocated (GiB)': '45.1', 'memory/device_reserved (GiB)': '63.62', 'tokens/train_per_sec_per_gpu': '270.4', 'tokens/trainable': 736843, 'tokens/total': 1585540, 'epoch': '0.1946'}

 19%|█████████████████████████████                                                                                                                         | 51/263 [37:28<2:40:22, 45.39s/it]
 20%|█████████████████████████████▋                                                                                                                        | 52/263 [38:10<2:36:32, 44.52s/it]
                                                                                                                                                                                              
{'loss': '0.7569', 'grad_norm': '0.1728', 'learning_rate': '0.0001946', 'ppl': '2.132', 'memory/max_active (GiB)': '20.75', 'memory/max_allocated (GiB)': '20.75', 'memory/device_reserved (GiB)': '25.67', 'tokens/train_per_sec_per_gpu': '179.7', 'tokens/trainable': 744474, 'tokens/total': 1602602, 'epoch': '0.1984'}

 20%|█████████████████████████████▋                                                                                                                        | 52/263 [38:10<2:36:32, 44.52s/it]
 20%|██████████████████████████████▏                                                                                                                       | 53/263 [38:53<2:33:22, 43.82s/it]
                                                                                                                                                                                              
{'loss': '0.81', 'grad_norm': '0.1633', 'learning_rate': '0.0001941', 'ppl': '2.248', 'memory/max_active (GiB)': '27.01', 'memory/max_allocated (GiB)': '27.01', 'memory/device_reserved (GiB)': '35.24', 'tokens/train_per_sec_per_gpu': '223.7', 'tokens/trainable': 753915, 'tokens/total': 1625426, 'epoch': '0.2022'}

 20%|██████████████████████████████▏                                                                                                                       | 53/263 [38:53<2:33:22, 43.82s/it]
 21%|██████████████████████████████▊                                                                                                                       | 54/263 [39:42<2:37:58, 45.35s/it]
                                                                                                                                                                                              
{'loss': '0.7126', 'grad_norm': '0.1385', 'learning_rate': '0.0001937', 'ppl': '2.039', 'memory/max_active (GiB)': '37.95', 'memory/max_allocated (GiB)': '37.95', 'memory/device_reserved (GiB)': '48.03', 'tokens/train_per_sec_per_gpu': '320.1', 'tokens/trainable': 769572, 'tokens/total': 1660362, 'epoch': '0.206'}

 21%|██████████████████████████████▊                                                                                                                       | 54/263 [39:42<2:37:58, 45.35s/it]
 21%|███████████████████████████████▎                                                                                                                      | 55/263 [40:31<2:41:57, 46.72s/it]
                                                                                                                                                                                              
{'loss': '0.6959', 'grad_norm': '0.1312', 'learning_rate': '0.0001932', 'ppl': '2.006', 'memory/max_active (GiB)': '45.85', 'memory/max_allocated (GiB)': '45.85', 'memory/device_reserved (GiB)': '64.3', 'tokens/train_per_sec_per_gpu': '344.5', 'tokens/trainable': 786763, 'tokens/total': 1697704, 'epoch': '0.2098'}

 21%|███████████████████████████████▎                                                                                                                      | 55/263 [40:31<2:41:57, 46.72s/it]
 21%|███████████████████████████████▉                                                                                                                      | 56/263 [41:15<2:37:46, 45.73s/it]
                                                                                                                                                                                              
{'loss': '0.7052', 'grad_norm': '0.1675', 'learning_rate': '0.0001927', 'ppl': '2.024', 'memory/max_active (GiB)': '28.46', 'memory/max_allocated (GiB)': '28.46', 'memory/device_reserved (GiB)': '37.35', 'tokens/train_per_sec_per_gpu': '235.5', 'tokens/trainable': 796991, 'tokens/total': 1720524, 'epoch': '0.2136'}

 21%|███████████████████████████████▉                                                                                                                      | 56/263 [41:15<2:37:46, 45.73s/it]
 22%|████████████████████████████████▌                                                                                                                     | 57/263 [42:15<2:51:52, 50.06s/it]
                                                                                                                                                                                              
{'loss': '0.6921', 'grad_norm': '0.1401', 'learning_rate': '0.0001922', 'ppl': '1.998', 'memory/max_active (GiB)': '54.96', 'memory/max_allocated (GiB)': '54.96', 'memory/device_reserved (GiB)': '78.17', 'tokens/train_per_sec_per_gpu': '296.9', 'tokens/trainable': 814853, 'tokens/total': 1756640, 'epoch': '0.2175'}

 22%|████████████████████████████████▌                                                                                                                     | 57/263 [42:15<2:51:52, 50.06s/it]
 22%|█████████████████████████████████                                                                                                                     | 58/263 [43:00<2:45:34, 48.46s/it]
                                                                                                                                                                                              
{'loss': '0.7537', 'grad_norm': '0.1416', 'learning_rate': '0.0001917', 'ppl': '2.125', 'memory/max_active (GiB)': '32.83', 'memory/max_allocated (GiB)': '32.83', 'memory/device_reserved (GiB)': '41.12', 'tokens/train_per_sec_per_gpu': '332.1', 'tokens/trainable': 829705, 'tokens/total': 1793570, 'epoch': '0.2213'}

 22%|█████████████████████████████████                                                                                                                     | 58/263 [43:00<2:45:34, 48.46s/it]
 22%|█████████████████████████████████▋                                                                                                                    | 59/263 [43:42<2:38:03, 46.49s/it]
                                                                                                                                                                                              
{'loss': '0.731', 'grad_norm': '0.1516', 'learning_rate': '0.0001911', 'ppl': '2.077', 'memory/max_active (GiB)': '33.92', 'memory/max_allocated (GiB)': '33.92', 'memory/device_reserved (GiB)': '43.25', 'tokens/train_per_sec_per_gpu': '357.5', 'tokens/trainable': 844681, 'tokens/total': 1827586, 'epoch': '0.2251'}

 22%|█████████████████████████████████▋                                                                                                                    | 59/263 [43:42<2:38:03, 46.49s/it]
 23%|██████████████████████████████████▏                                                                                                                   | 60/263 [44:31<2:40:35, 47.46s/it]
                                                                                                                                                                                              
{'loss': '0.7322', 'grad_norm': '0.1597', 'learning_rate': '0.0001906', 'ppl': '2.08', 'memory/max_active (GiB)': '27.46', 'memory/max_allocated (GiB)': '27.46', 'memory/device_reserved (GiB)': '42.7', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 855924, 'tokens/total': 1853648, 'epoch': '0.2289'}

 23%|██████████████████████████████████▏                                                                                                                   | 60/263 [44:31<2:40:35, 47.46s/it]
 23%|██████████████████████████████████▊                                                                                                                   | 61/263 [45:19<2:39:37, 47.41s/it]
                                                                                                                                                                                              
{'loss': '0.7511', 'grad_norm': '0.1819', 'learning_rate': '0.00019', 'ppl': '2.119', 'memory/max_active (GiB)': '27.97', 'memory/max_allocated (GiB)': '27.97', 'memory/device_reserved (GiB)': '36.56', 'tokens/train_per_sec_per_gpu': '225.6', 'tokens/trainable': 866595, 'tokens/total': 1878090, 'epoch': '0.2327'}

 23%|██████████████████████████████████▊                                                                                                                   | 61/263 [45:19<2:39:37, 47.41s/it]
 24%|███████████████████████████████████▎                                                                                                                  | 62/263 [46:01<2:33:28, 45.81s/it]
                                                                                                                                                                                              
{'loss': '0.6627', 'grad_norm': '0.1462', 'learning_rate': '0.0001894', 'ppl': '1.94', 'memory/max_active (GiB)': '38.92', 'memory/max_allocated (GiB)': '38.92', 'memory/device_reserved (GiB)': '53.8', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 882309, 'tokens/total': 1913740, 'epoch': '0.2365'}

 24%|███████████████████████████████████▎                                                                                                                  | 62/263 [46:01<2:33:28, 45.81s/it]
 24%|███████████████████████████████████▉                                                                                                                  | 63/263 [47:16<3:01:53, 54.57s/it]
                                                                                                                                                                                              
{'loss': '0.6414', 'grad_norm': '0.1468', 'learning_rate': '0.0001888', 'ppl': '1.899', 'memory/max_active (GiB)': '56.13', 'memory/max_allocated (GiB)': '56.13', 'memory/device_reserved (GiB)': '73.46', 'tokens/train_per_sec_per_gpu': '332.8', 'tokens/trainable': 907269, 'tokens/total': 1971906, 'epoch': '0.2403'}

 24%|███████████████████████████████████▉                                                                                                                  | 63/263 [47:16<3:01:53, 54.57s/it]
 24%|████████████████████████████████████▌                                                                                                                 | 64/263 [47:56<2:46:47, 50.29s/it]
                                                                                                                                                                                              
{'loss': '0.7153', 'grad_norm': '0.2024', 'learning_rate': '0.0001882', 'ppl': '2.045', 'memory/max_active (GiB)': '53.82', 'memory/max_allocated (GiB)': '53.82', 'memory/device_reserved (GiB)': '76.49', 'tokens/train_per_sec_per_gpu': '322', 'tokens/trainable': 920250, 'tokens/total': 2002390, 'epoch': '0.2442'}

 24%|████████████████████████████████████▌                                                                                                                 | 64/263 [47:56<2:46:47, 50.29s/it]
 25%|█████████████████████████████████████                                                                                                                 | 65/263 [48:30<2:29:57, 45.44s/it]
                                                                                                                                                                                              
{'loss': '0.7389', 'grad_norm': '0.1698', 'learning_rate': '0.0001876', 'ppl': '2.094', 'memory/max_active (GiB)': '29.14', 'memory/max_allocated (GiB)': '29.14', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '324.5', 'tokens/trainable': 931325, 'tokens/total': 2028270, 'epoch': '0.248'}

 25%|█████████████████████████████████████                                                                                                                 | 65/263 [48:30<2:29:57, 45.44s/it]
 25%|█████████████████████████████████████▋                                                                                                                | 66/263 [49:17<2:30:15, 45.76s/it]
                                                                                                                                                                                              
{'loss': '0.8371', 'grad_norm': '0.2279', 'learning_rate': '0.0001869', 'ppl': '2.31', 'memory/max_active (GiB)': '33.75', 'memory/max_allocated (GiB)': '33.75', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '214.5', 'tokens/trainable': 941301, 'tokens/total': 2051108, 'epoch': '0.2518'}

 25%|█████████████████████████████████████▋                                                                                                                | 66/263 [49:17<2:30:15, 45.76s/it]
 25%|██████████████████████████████████████▏                                                                                                               | 67/263 [50:05<2:31:52, 46.49s/it]
                                                                                                                                                                                              
{'loss': '0.705', 'grad_norm': '0.1503', 'learning_rate': '0.0001863', 'ppl': '2.024', 'memory/max_active (GiB)': '27.18', 'memory/max_allocated (GiB)': '27.18', 'memory/device_reserved (GiB)': '35.22', 'tokens/train_per_sec_per_gpu': '279', 'tokens/trainable': 954748, 'tokens/total': 2079644, 'epoch': '0.2556'}

 25%|██████████████████████████████████████▏                                                                                                               | 67/263 [50:05<2:31:52, 46.49s/it]
 26%|██████████████████████████████████████▊                                                                                                               | 68/263 [50:43<2:23:13, 44.07s/it]
                                                                                                                                                                                              
{'loss': '0.7131', 'grad_norm': '0.1581', 'learning_rate': '0.0001856', 'ppl': '2.04', 'memory/max_active (GiB)': '34.76', 'memory/max_allocated (GiB)': '34.76', 'memory/device_reserved (GiB)': '43.81', 'tokens/train_per_sec_per_gpu': '356.9', 'tokens/trainable': 968461, 'tokens/total': 2106514, 'epoch': '0.2594'}

 26%|██████████████████████████████████████▊                                                                                                               | 68/263 [50:43<2:23:13, 44.07s/it]
 26%|███████████████████████████████████████▎                                                                                                              | 69/263 [51:16<2:11:22, 40.63s/it]
                                                                                                                                                                                              
{'loss': '0.7511', 'grad_norm': '0.1736', 'learning_rate': '0.0001849', 'ppl': '2.119', 'memory/max_active (GiB)': '27.61', 'memory/max_allocated (GiB)': '27.61', 'memory/device_reserved (GiB)': '36.19', 'tokens/train_per_sec_per_gpu': '401.6', 'tokens/trainable': 981554, 'tokens/total': 2133242, 'epoch': '0.2632'}

 26%|███████████████████████████████████████▎                                                                                                              | 69/263 [51:16<2:11:22, 40.63s/it]
 27%|███████████████████████████████████████▉                                                                                                              | 70/263 [51:47<2:00:59, 37.61s/it]
                                                                                                                                                                                              
{'loss': '0.7844', 'grad_norm': '0.1974', 'learning_rate': '0.0001842', 'ppl': '2.191', 'memory/max_active (GiB)': '22.87', 'memory/max_allocated (GiB)': '22.87', 'memory/device_reserved (GiB)': '35.35', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 990994, 'tokens/total': 2153354, 'epoch': '0.267'}

 27%|███████████████████████████████████████▉                                                                                                              | 70/263 [51:47<2:00:59, 37.61s/it]
 27%|████████████████████████████████████████▍                                                                                                             | 71/263 [52:43<2:18:20, 43.23s/it]
                                                                                                                                                                                              
{'loss': '0.6828', 'grad_norm': '0.2258', 'learning_rate': '0.0001835', 'ppl': '1.98', 'memory/max_active (GiB)': '39.9', 'memory/max_allocated (GiB)': '39.9', 'memory/device_reserved (GiB)': '55.24', 'tokens/train_per_sec_per_gpu': '225.3', 'tokens/trainable': 1003686, 'tokens/total': 2182806, 'epoch': '0.2709'}

 27%|████████████████████████████████████████▍                                                                                                             | 71/263 [52:43<2:18:20, 43.23s/it]
 27%|█████████████████████████████████████████                                                                                                             | 72/263 [54:06<2:55:59, 55.29s/it]
                                                                                                                                                                                              
{'loss': '0.5772', 'grad_norm': '0.1597', 'learning_rate': '0.0001827', 'ppl': '1.781', 'memory/max_active (GiB)': '59.04', 'memory/max_allocated (GiB)': '59.04', 'memory/device_reserved (GiB)': '76.68', 'tokens/train_per_sec_per_gpu': '300.7', 'tokens/trainable': 1028766, 'tokens/total': 2241490, 'epoch': '0.2747'}

 27%|█████████████████████████████████████████                                                                                                             | 72/263 [54:06<2:55:59, 55.29s/it]
 28%|█████████████████████████████████████████▋                                                                                                            | 73/263 [54:46<2:39:57, 50.51s/it]
                                                                                                                                                                                              
{'loss': '0.6909', 'grad_norm': '0.1693', 'learning_rate': '0.000182', 'ppl': '1.995', 'memory/max_active (GiB)': '42.9', 'memory/max_allocated (GiB)': '42.9', 'memory/device_reserved (GiB)': '59.81', 'tokens/train_per_sec_per_gpu': '349.6', 'tokens/trainable': 1042530, 'tokens/total': 2272086, 'epoch': '0.2785'}

 28%|█████████████████████████████████████████▋                                                                                                            | 73/263 [54:46<2:39:57, 50.51s/it]
 28%|██████████████████████████████████████████▏                                                                                                           | 74/263 [55:30<2:33:13, 48.65s/it]
                                                                                                                                                                                              
{'loss': '0.7108', 'grad_norm': '0.1959', 'learning_rate': '0.0001812', 'ppl': '2.036', 'memory/max_active (GiB)': '28.32', 'memory/max_allocated (GiB)': '28.32', 'memory/device_reserved (GiB)': '37.09', 'tokens/train_per_sec_per_gpu': '224', 'tokens/trainable': 1052450, 'tokens/total': 2294892, 'epoch': '0.2823'}

 28%|██████████████████████████████████████████▏                                                                                                           | 74/263 [55:30<2:33:13, 48.65s/it]
 29%|██████████████████████████████████████████▊                                                                                                           | 75/263 [56:22<2:35:57, 49.78s/it]
                                                                                                                                                                                              
{'loss': '0.7308', 'grad_norm': '0.1824', 'learning_rate': '0.0001804', 'ppl': '2.077', 'memory/max_active (GiB)': '25.12', 'memory/max_allocated (GiB)': '25.12', 'memory/device_reserved (GiB)': '32.62', 'tokens/train_per_sec_per_gpu': '252.4', 'tokens/trainable': 1065680, 'tokens/total': 2320714, 'epoch': '0.2861'}

 29%|██████████████████████████████████████████▊                                                                                                           | 75/263 [56:22<2:35:57, 49.78s/it]
 29%|███████████████████████████████████████████▎                                                                                                          | 76/263 [57:06<2:29:33, 47.99s/it]
                                                                                                                                                                                              
{'loss': '0.6717', 'grad_norm': '0.1747', 'learning_rate': '0.0001796', 'ppl': '1.958', 'memory/max_active (GiB)': '23.01', 'memory/max_allocated (GiB)': '23.01', 'memory/device_reserved (GiB)': '29.15', 'tokens/train_per_sec_per_gpu': '217.3', 'tokens/trainable': 1075199, 'tokens/total': 2342076, 'epoch': '0.2899'}

 29%|███████████████████████████████████████████▎                                                                                                          | 76/263 [57:06<2:29:33, 47.99s/it]
 29%|███████████████████████████████████████████▉                                                                                                          | 77/263 [57:38<2:13:45, 43.15s/it]
                                                                                                                                                                                              
{'loss': '0.7283', 'grad_norm': '0.1765', 'learning_rate': '0.0001788', 'ppl': '2.071', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '353.1', 'tokens/trainable': 1086448, 'tokens/total': 2367826, 'epoch': '0.2938'}

 29%|███████████████████████████████████████████▉                                                                                                          | 77/263 [57:38<2:13:45, 43.15s/it]
 30%|████████████████████████████████████████████▍                                                                                                         | 78/263 [58:24<2:15:13, 43.86s/it]
                                                                                                                                                                                              
{'loss': '0.6897', 'grad_norm': '0.133', 'learning_rate': '0.000178', 'ppl': '1.993', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.97', 'tokens/train_per_sec_per_gpu': '393.6', 'tokens/trainable': 1104361, 'tokens/total': 2406764, 'epoch': '0.2976'}

 30%|████████████████████████████████████████████▍                                                                                                         | 78/263 [58:24<2:15:13, 43.86s/it]
 30%|█████████████████████████████████████████████                                                                                                         | 79/263 [59:16<2:22:50, 46.58s/it]
                                                                                                                                                                                              
{'loss': '0.726', 'grad_norm': '0.1792', 'learning_rate': '0.0001772', 'ppl': '2.067', 'memory/max_active (GiB)': '40.91', 'memory/max_allocated (GiB)': '40.91', 'memory/device_reserved (GiB)': '56.76', 'tokens/train_per_sec_per_gpu': '245.3', 'tokens/trainable': 1117342, 'tokens/total': 2434766, 'epoch': '0.3014'}

 30%|█████████████████████████████████████████████                                                                                                         | 79/263 [59:16<2:22:50, 46.58s/it]
 30%|█████████████████████████████████████████████                                                                                                       | 80/263 [1:00:12<2:30:27, 49.33s/it]
                                                                                                                                                                                              
{'loss': '0.7226', 'grad_norm': '0.1773', 'learning_rate': '0.0001763', 'ppl': '2.06', 'memory/max_active (GiB)': '24.17', 'memory/max_allocated (GiB)': '24.17', 'memory/device_reserved (GiB)': '30.76', 'tokens/train_per_sec_per_gpu': '201.2', 'tokens/trainable': 1128559, 'tokens/total': 2459278, 'epoch': '0.3052'}

 30%|█████████████████████████████████████████████                                                                                                       | 80/263 [1:00:12<2:30:27, 49.33s/it]
 31%|█████████████████████████████████████████████▌                                                                                                      | 81/263 [1:00:56<2:24:41, 47.70s/it]
                                                                                                                                                                                              
{'loss': '0.687', 'grad_norm': '0.2009', 'learning_rate': '0.0001755', 'ppl': '1.988', 'memory/max_active (GiB)': '25.85', 'memory/max_allocated (GiB)': '25.85', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '252', 'tokens/trainable': 1139622, 'tokens/total': 2482288, 'epoch': '0.309'}

 31%|█████████████████████████████████████████████▌                                                                                                      | 81/263 [1:00:56<2:24:41, 47.70s/it]
 31%|██████████████████████████████████████████████▏                                                                                                     | 82/263 [1:01:30<2:11:37, 43.64s/it]
                                                                                                                                                                                              
{'loss': '0.6404', 'grad_norm': '0.1794', 'learning_rate': '0.0001746', 'ppl': '1.897', 'memory/max_active (GiB)': '34.83', 'memory/max_allocated (GiB)': '34.83', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '332.6', 'tokens/trainable': 1150978, 'tokens/total': 2508102, 'epoch': '0.3128'}

 31%|██████████████████████████████████████████████▏                                                                                                     | 82/263 [1:01:30<2:11:37, 43.64s/it]
 32%|██████████████████████████████████████████████▋                                                                                                     | 83/263 [1:02:17<2:13:57, 44.65s/it]
                                                                                                                                                                                              
{'loss': '0.6434', 'grad_norm': '0.1725', 'learning_rate': '0.0001737', 'ppl': '1.903', 'memory/max_active (GiB)': '34.96', 'memory/max_allocated (GiB)': '34.96', 'memory/device_reserved (GiB)': '43.95', 'tokens/train_per_sec_per_gpu': '264.2', 'tokens/trainable': 1163404, 'tokens/total': 2534552, 'epoch': '0.3166'}

 32%|██████████████████████████████████████████████▋                                                                                                     | 83/263 [1:02:17<2:13:57, 44.65s/it]
 32%|███████████████████████████████████████████████▎                                                                                                    | 84/263 [1:03:06<2:17:12, 45.99s/it]
                                                                                                                                                                                              
{'loss': '0.6432', 'grad_norm': '0.144', 'learning_rate': '0.0001728', 'ppl': '1.903', 'memory/max_active (GiB)': '24.85', 'memory/max_allocated (GiB)': '24.85', 'memory/device_reserved (GiB)': '32.01', 'tokens/train_per_sec_per_gpu': '300', 'tokens/trainable': 1178135, 'tokens/total': 2562982, 'epoch': '0.3205'}

 32%|███████████████████████████████████████████████▎                                                                                                    | 84/263 [1:03:06<2:17:12, 45.99s/it]
 32%|███████████████████████████████████████████████▊                                                                                                    | 85/263 [1:03:53<2:16:37, 46.05s/it]
                                                                                                                                                                                              
{'loss': '0.7349', 'grad_norm': '0.2154', 'learning_rate': '0.0001719', 'ppl': '2.085', 'memory/max_active (GiB)': '25.97', 'memory/max_allocated (GiB)': '25.97', 'memory/device_reserved (GiB)': '33.5', 'tokens/train_per_sec_per_gpu': '201.4', 'tokens/trainable': 1187438, 'tokens/total': 2585130, 'epoch': '0.3243'}

 32%|███████████████████████████████████████████████▊                                                                                                    | 85/263 [1:03:53<2:16:37, 46.05s/it]
 33%|████████████████████████████████████████████████▍                                                                                                   | 86/263 [1:04:34<2:11:44, 44.66s/it]
                                                                                                                                                                                              
{'loss': '0.667', 'grad_norm': '0.1647', 'learning_rate': '0.0001709', 'ppl': '1.948', 'memory/max_active (GiB)': '32.35', 'memory/max_allocated (GiB)': '32.35', 'memory/device_reserved (GiB)': '35.79', 'tokens/train_per_sec_per_gpu': '308.1', 'tokens/trainable': 1200197, 'tokens/total': 2610804, 'epoch': '0.3281'}

 33%|████████████████████████████████████████████████▍                                                                                                   | 86/263 [1:04:34<2:11:44, 44.66s/it]
 33%|████████████████████████████████████████████████▉                                                                                                   | 87/263 [1:05:11<2:03:50, 42.22s/it]
                                                                                                                                                                                              
{'loss': '0.7045', 'grad_norm': '0.1813', 'learning_rate': '0.00017', 'ppl': '2.023', 'memory/max_active (GiB)': '34.17', 'memory/max_allocated (GiB)': '34.17', 'memory/device_reserved (GiB)': '42.85', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 1211474, 'tokens/total': 2635758, 'epoch': '0.3319'}

 33%|████████████████████████████████████████████████▉                                                                                                   | 87/263 [1:05:11<2:03:50, 42.22s/it]
 33%|█████████████████████████████████████████████████▌                                                                                                  | 88/263 [1:06:03<2:12:03, 45.28s/it]
                                                                                                                                                                                              
{'loss': '0.7118', 'grad_norm': '0.1373', 'learning_rate': '0.0001691', 'ppl': '2.038', 'memory/max_active (GiB)': '49.54', 'memory/max_allocated (GiB)': '49.54', 'memory/device_reserved (GiB)': '69.87', 'tokens/train_per_sec_per_gpu': '390', 'tokens/trainable': 1231915, 'tokens/total': 2679352, 'epoch': '0.3357'}

 33%|█████████████████████████████████████████████████▌                                                                                                  | 88/263 [1:06:03<2:12:03, 45.28s/it]
 34%|██████████████████████████████████████████████████                                                                                                  | 89/263 [1:06:36<2:00:21, 41.50s/it]
                                                                                                                                                                                              
{'loss': '0.7212', 'grad_norm': '0.1959', 'learning_rate': '0.0001681', 'ppl': '2.057', 'memory/max_active (GiB)': '26.55', 'memory/max_allocated (GiB)': '26.55', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '346.5', 'tokens/trainable': 1243243, 'tokens/total': 2703340, 'epoch': '0.3395'}

 34%|██████████████████████████████████████████████████                                                                                                  | 89/263 [1:06:36<2:00:21, 41.50s/it]
 34%|██████████████████████████████████████████████████▋                                                                                                 | 90/263 [1:07:06<1:50:05, 38.18s/it]
                                                                                                                                                                                              
{'loss': '0.7258', 'grad_norm': '0.2195', 'learning_rate': '0.0001671', 'ppl': '2.066', 'memory/max_active (GiB)': '30.13', 'memory/max_allocated (GiB)': '30.13', 'memory/device_reserved (GiB)': '39.89', 'tokens/train_per_sec_per_gpu': '316.4', 'tokens/trainable': 1252869, 'tokens/total': 2726904, 'epoch': '0.3433'}

 34%|██████████████████████████████████████████████████▋                                                                                                 | 90/263 [1:07:06<1:50:05, 38.18s/it]
 35%|███████████████████████████████████████████████████▏                                                                                                | 91/263 [1:07:50<1:54:23, 39.90s/it]
                                                                                                                                                                                              
{'loss': '0.7356', 'grad_norm': '0.1998', 'learning_rate': '0.0001661', 'ppl': '2.087', 'memory/max_active (GiB)': '36.96', 'memory/max_allocated (GiB)': '36.96', 'memory/device_reserved (GiB)': '46.66', 'tokens/train_per_sec_per_gpu': '350.6', 'tokens/trainable': 1268267, 'tokens/total': 2760058, 'epoch': '0.3472'}

 35%|███████████████████████████████████████████████████▏                                                                                                | 91/263 [1:07:50<1:54:23, 39.90s/it]
 35%|███████████████████████████████████████████████████▊                                                                                                | 92/263 [1:08:35<1:57:53, 41.37s/it]
                                                                                                                                                                                              
{'loss': '0.6534', 'grad_norm': '0.1845', 'learning_rate': '0.0001651', 'ppl': '1.922', 'memory/max_active (GiB)': '35.55', 'memory/max_allocated (GiB)': '35.55', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '339.5', 'tokens/trainable': 1283469, 'tokens/total': 2794120, 'epoch': '0.351'}

 35%|███████████████████████████████████████████████████▊                                                                                                | 92/263 [1:08:35<1:57:53, 41.37s/it]
 35%|████████████████████████████████████████████████████▎                                                                                               | 93/263 [1:09:07<1:49:00, 38.47s/it]
                                                                                                                                                                                              
{'loss': '0.7409', 'grad_norm': '0.1926', 'learning_rate': '0.0001641', 'ppl': '2.098', 'memory/max_active (GiB)': '23.77', 'memory/max_allocated (GiB)': '23.77', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '334.8', 'tokens/trainable': 1294090, 'tokens/total': 2818588, 'epoch': '0.3548'}

 35%|████████████████████████████████████████████████████▎                                                                                               | 93/263 [1:09:07<1:49:00, 38.47s/it]
 36%|████████████████████████████████████████████████████▉                                                                                               | 94/263 [1:09:49<1:51:37, 39.63s/it]
                                                                                                                                                                                              
{'loss': '0.6379', 'grad_norm': '0.1578', 'learning_rate': '0.0001631', 'ppl': '1.892', 'memory/max_active (GiB)': '34.71', 'memory/max_allocated (GiB)': '34.71', 'memory/device_reserved (GiB)': '43.64', 'tokens/train_per_sec_per_gpu': '369.7', 'tokens/trainable': 1309742, 'tokens/total': 2851686, 'epoch': '0.3586'}

 36%|████████████████████████████████████████████████████▉                                                                                               | 94/263 [1:09:49<1:51:37, 39.63s/it]
 36%|█████████████████████████████████████████████████████▍                                                                                              | 95/263 [1:10:26<1:49:15, 39.02s/it]
                                                                                                                                                                                              
{'loss': '0.7076', 'grad_norm': '0.1786', 'learning_rate': '0.0001621', 'ppl': '2.029', 'memory/max_active (GiB)': '33.53', 'memory/max_allocated (GiB)': '33.53', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '353.4', 'tokens/trainable': 1323027, 'tokens/total': 2880764, 'epoch': '0.3624'}

 36%|█████████████████████████████████████████████████████▍                                                                                              | 95/263 [1:10:26<1:49:15, 39.02s/it]
 37%|██████████████████████████████████████████████████████                                                                                              | 96/263 [1:10:59<1:42:48, 36.94s/it]
                                                                                                                                                                                              
{'loss': '0.6912', 'grad_norm': '0.1893', 'learning_rate': '0.000161', 'ppl': '1.996', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '44.59', 'tokens/train_per_sec_per_gpu': '352.2', 'tokens/trainable': 1334326, 'tokens/total': 2905950, 'epoch': '0.3662'}

 37%|██████████████████████████████████████████████████████                                                                                              | 96/263 [1:10:59<1:42:48, 36.94s/it]
 37%|██████████████████████████████████████████████████████▌                                                                                             | 97/263 [1:11:43<1:48:20, 39.16s/it]
                                                                                                                                                                                              
{'loss': '0.6708', 'grad_norm': '0.173', 'learning_rate': '0.00016', 'ppl': '1.956', 'memory/max_active (GiB)': '47.2', 'memory/max_allocated (GiB)': '47.2', 'memory/device_reserved (GiB)': '66.49', 'tokens/train_per_sec_per_gpu': '351.1', 'tokens/trainable': 1349894, 'tokens/total': 2942380, 'epoch': '0.3701'}

 37%|██████████████████████████████████████████████████████▌                                                                                             | 97/263 [1:11:43<1:48:20, 39.16s/it]
 37%|███████████████████████████████████████████████████████▏                                                                                            | 98/263 [1:12:38<2:00:32, 43.83s/it]
                                                                                                                                                                                              
{'loss': '0.685', 'grad_norm': '0.1576', 'learning_rate': '0.0001589', 'ppl': '1.984', 'memory/max_active (GiB)': '37.19', 'memory/max_allocated (GiB)': '37.19', 'memory/device_reserved (GiB)': '47.07', 'tokens/train_per_sec_per_gpu': '357.9', 'tokens/trainable': 1369481, 'tokens/total': 2988904, 'epoch': '0.3739'}

 37%|███████████████████████████████████████████████████████▏                                                                                            | 98/263 [1:12:38<2:00:32, 43.83s/it]
 38%|███████████████████████████████████████████████████████▋                                                                                            | 99/263 [1:13:21<1:59:39, 43.77s/it]
                                                                                                                                                                                              
{'loss': '0.6715', 'grad_norm': '0.1494', 'learning_rate': '0.0001578', 'ppl': '1.957', 'memory/max_active (GiB)': '42.43', 'memory/max_allocated (GiB)': '42.43', 'memory/device_reserved (GiB)': '59.1', 'tokens/train_per_sec_per_gpu': '384.8', 'tokens/trainable': 1386274, 'tokens/total': 3026814, 'epoch': '0.3777'}

 38%|███████████████████████████████████████████████████████▋                                                                                            | 99/263 [1:13:21<1:59:39, 43.77s/it]
 38%|███████████████████████████████████████████████████████▉                                                                                           | 100/263 [1:13:55<1:51:09, 40.92s/it]
                                                                                                                                                                                              
{'loss': '0.6481', 'grad_norm': '0.208', 'learning_rate': '0.0001567', 'ppl': '1.912', 'memory/max_active (GiB)': '26.51', 'memory/max_allocated (GiB)': '26.51', 'memory/device_reserved (GiB)': '34.38', 'tokens/train_per_sec_per_gpu': '296.5', 'tokens/trainable': 1396430, 'tokens/total': 3050820, 'epoch': '0.3815'}

 38%|███████████████████████████████████████████████████████▉                                                                                           | 100/263 [1:13:55<1:51:09, 40.92s/it]
 38%|████████████████████████████████████████████████████████▍                                                                                          | 101/263 [1:14:53<2:04:04, 45.96s/it]
                                                                                                                                                                                              
{'loss': '0.7205', 'grad_norm': '0.1546', 'learning_rate': '0.0001556', 'ppl': '2.056', 'memory/max_active (GiB)': '44.95', 'memory/max_allocated (GiB)': '44.95', 'memory/device_reserved (GiB)': '62.88', 'tokens/train_per_sec_per_gpu': '350', 'tokens/trainable': 1416631, 'tokens/total': 3099000, 'epoch': '0.3853'}

 38%|████████████████████████████████████████████████████████▍                                                                                          | 101/263 [1:14:53<2:04:04, 45.96s/it]
 39%|█████████████████████████████████████████████████████████                                                                                          | 102/263 [1:15:38<2:02:20, 45.60s/it]
                                                                                                                                                                                              
{'loss': '0.7284', 'grad_norm': '0.1686', 'learning_rate': '0.0001545', 'ppl': '2.072', 'memory/max_active (GiB)': '32', 'memory/max_allocated (GiB)': '32', 'memory/device_reserved (GiB)': '42.76', 'tokens/train_per_sec_per_gpu': '367.9', 'tokens/trainable': 1433096, 'tokens/total': 3134960, 'epoch': '0.3891'}

 39%|█████████████████████████████████████████████████████████                                                                                          | 102/263 [1:15:38<2:02:20, 45.60s/it]
 39%|█████████████████████████████████████████████████████████▌                                                                                         | 103/263 [1:16:06<1:47:33, 40.33s/it]
                                                                                                                                                                                              
{'loss': '0.6715', 'grad_norm': '0.1936', 'learning_rate': '0.0001534', 'ppl': '1.957', 'memory/max_active (GiB)': '26.34', 'memory/max_allocated (GiB)': '26.34', 'memory/device_reserved (GiB)': '34.07', 'tokens/train_per_sec_per_gpu': '351.9', 'tokens/trainable': 1442967, 'tokens/total': 3156628, 'epoch': '0.3929'}

 39%|█████████████████████████████████████████████████████████▌                                                                                         | 103/263 [1:16:06<1:47:33, 40.33s/it]
 40%|██████████████████████████████████████████████████████████▏                                                                                        | 104/263 [1:16:41<1:42:26, 38.66s/it]
                                                                                                                                                                                              
{'loss': '0.5962', 'grad_norm': '0.1888', 'learning_rate': '0.0001523', 'ppl': '1.815', 'memory/max_active (GiB)': '28.66', 'memory/max_allocated (GiB)': '28.66', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 1453178, 'tokens/total': 3180850, 'epoch': '0.3968'}

 40%|██████████████████████████████████████████████████████████▏                                                                                        | 104/263 [1:16:41<1:42:26, 38.66s/it]
 40%|██████████████████████████████████████████████████████████▋                                                                                        | 105/263 [1:17:11<1:35:30, 36.27s/it]
                                                                                                                                                                                              
{'loss': '0.6637', 'grad_norm': '0.1802', 'learning_rate': '0.0001511', 'ppl': '1.942', 'memory/max_active (GiB)': '28.04', 'memory/max_allocated (GiB)': '28.04', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '375', 'tokens/trainable': 1464686, 'tokens/total': 3203888, 'epoch': '0.4006'}

 40%|██████████████████████████████████████████████████████████▋                                                                                        | 105/263 [1:17:11<1:35:30, 36.27s/it]
 40%|███████████████████████████████████████████████████████████▏                                                                                       | 106/263 [1:17:42<1:30:16, 34.50s/it]
                                                                                                                                                                                              
{'loss': '0.7321', 'grad_norm': '0.2234', 'learning_rate': '0.00015', 'ppl': '2.079', 'memory/max_active (GiB)': '21.56', 'memory/max_allocated (GiB)': '21.56', 'memory/device_reserved (GiB)': '28.04', 'tokens/train_per_sec_per_gpu': '313.8', 'tokens/trainable': 1474221, 'tokens/total': 3224060, 'epoch': '0.4044'}

 40%|███████████████████████████████████████████████████████████▏                                                                                       | 106/263 [1:17:42<1:30:16, 34.50s/it]
 41%|███████████████████████████████████████████████████████████▊                                                                                       | 107/263 [1:18:34<1:43:12, 39.70s/it]
                                                                                                                                                                                              
{'loss': '0.5681', 'grad_norm': '0.162', 'learning_rate': '0.0001488', 'ppl': '1.765', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '53.23', 'tokens/train_per_sec_per_gpu': '297.5', 'tokens/trainable': 1489638, 'tokens/total': 3260576, 'epoch': '0.4082'}

 41%|███████████████████████████████████████████████████████████▊                                                                                       | 107/263 [1:18:34<1:43:12, 39.70s/it]
 41%|████████████████████████████████████████████████████████████▎                                                                                      | 108/263 [1:19:20<1:47:51, 41.75s/it]
                                                                                                                                                                                              
{'loss': '0.708', 'grad_norm': '0.197', 'learning_rate': '0.0001477', 'ppl': '2.03', 'memory/max_active (GiB)': '29.18', 'memory/max_allocated (GiB)': '29.18', 'memory/device_reserved (GiB)': '38.48', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 1502425, 'tokens/total': 3287854, 'epoch': '0.412'}

 41%|████████████████████████████████████████████████████████████▎                                                                                      | 108/263 [1:19:20<1:47:51, 41.75s/it]
 41%|████████████████████████████████████████████████████████████▉                                                                                      | 109/263 [1:20:18<1:59:25, 46.53s/it]
                                                                                                                                                                                              
{'loss': '0.6689', 'grad_norm': '0.1784', 'learning_rate': '0.0001465', 'ppl': '1.952', 'memory/max_active (GiB)': '38.9', 'memory/max_allocated (GiB)': '38.9', 'memory/device_reserved (GiB)': '53.77', 'tokens/train_per_sec_per_gpu': '320.6', 'tokens/trainable': 1520916, 'tokens/total': 3326198, 'epoch': '0.4158'}

 41%|████████████████████████████████████████████████████████████▉                                                                                      | 109/263 [1:20:18<1:59:25, 46.53s/it]
 42%|█████████████████████████████████████████████████████████████▍                                                                                     | 110/263 [1:21:01<1:56:00, 45.49s/it]
                                                                                                                                                                                              
{'loss': '0.6355', 'grad_norm': '0.1931', 'learning_rate': '0.0001453', 'ppl': '1.888', 'memory/max_active (GiB)': '28.51', 'memory/max_allocated (GiB)': '28.51', 'memory/device_reserved (GiB)': '43.21', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 1532644, 'tokens/total': 3353708, 'epoch': '0.4196'}

 42%|█████████████████████████████████████████████████████████████▍                                                                                     | 110/263 [1:21:01<1:56:00, 45.49s/it]
 42%|██████████████████████████████████████████████████████████████                                                                                     | 111/263 [1:21:37<1:48:21, 42.77s/it]
                                                                                                                                                                                              
{'loss': '0.6837', 'grad_norm': '0.1755', 'learning_rate': '0.0001442', 'ppl': '1.981', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.12', 'tokens/train_per_sec_per_gpu': '395.7', 'tokens/trainable': 1547058, 'tokens/total': 3383196, 'epoch': '0.4235'}

 42%|██████████████████████████████████████████████████████████████                                                                                     | 111/263 [1:21:37<1:48:21, 42.77s/it]
 43%|██████████████████████████████████████████████████████████████▌                                                                                    | 112/263 [1:22:31<1:55:30, 45.90s/it]
                                                                                                                                                                                              
{'loss': '0.6215', 'grad_norm': '0.2103', 'learning_rate': '0.000143', 'ppl': '1.862', 'memory/max_active (GiB)': '40.6', 'memory/max_allocated (GiB)': '40.6', 'memory/device_reserved (GiB)': '56.43', 'tokens/train_per_sec_per_gpu': '248.4', 'tokens/trainable': 1560272, 'tokens/total': 3415688, 'epoch': '0.4273'}

 43%|██████████████████████████████████████████████████████████████▌                                                                                    | 112/263 [1:22:31<1:55:30, 45.90s/it]
 43%|███████████████████████████████████████████████████████████████▏                                                                                   | 113/263 [1:23:20<1:57:39, 47.06s/it]
                                                                                                                                                                                              
{'loss': '0.6699', 'grad_norm': '0.1813', 'learning_rate': '0.0001418', 'ppl': '1.954', 'memory/max_active (GiB)': '37.04', 'memory/max_allocated (GiB)': '37.04', 'memory/device_reserved (GiB)': '46.8', 'tokens/train_per_sec_per_gpu': '282.5', 'tokens/trainable': 1574335, 'tokens/total': 3444166, 'epoch': '0.4311'}

 43%|███████████████████████████████████████████████████████████████▏                                                                                   | 113/263 [1:23:20<1:57:39, 47.06s/it]
 43%|███████████████████████████████████████████████████████████████▋                                                                                   | 114/263 [1:24:12<2:00:02, 48.34s/it]
                                                                                                                                                                                              
{'loss': '0.594', 'grad_norm': '0.1998', 'learning_rate': '0.0001406', 'ppl': '1.811', 'memory/max_active (GiB)': '36.6', 'memory/max_allocated (GiB)': '36.6', 'memory/device_reserved (GiB)': '46.27', 'tokens/train_per_sec_per_gpu': '276.6', 'tokens/trainable': 1588531, 'tokens/total': 3472530, 'epoch': '0.4349'}

 43%|███████████████████████████████████████████████████████████████▋                                                                                   | 114/263 [1:24:12<2:00:02, 48.34s/it]
 44%|████████████████████████████████████████████████████████████████▎                                                                                  | 115/263 [1:25:04<2:01:51, 49.40s/it]
                                                                                                                                                                                              
{'loss': '0.7205', 'grad_norm': '0.1913', 'learning_rate': '0.0001393', 'ppl': '2.055', 'memory/max_active (GiB)': '37.17', 'memory/max_allocated (GiB)': '37.17', 'memory/device_reserved (GiB)': '47.05', 'tokens/train_per_sec_per_gpu': '279.3', 'tokens/trainable': 1603020, 'tokens/total': 3503644, 'epoch': '0.4387'}

 44%|████████████████████████████████████████████████████████████████▎                                                                                  | 115/263 [1:25:04<2:01:51, 49.40s/it]
 44%|████████████████████████████████████████████████████████████████▊                                                                                  | 116/263 [1:25:48<1:57:32, 47.97s/it]
                                                                                                                                                                                              
{'loss': '0.7426', 'grad_norm': '0.2782', 'learning_rate': '0.0001381', 'ppl': '2.101', 'memory/max_active (GiB)': '40', 'memory/max_allocated (GiB)': '40', 'memory/device_reserved (GiB)': '55.41', 'tokens/train_per_sec_per_gpu': '231.5', 'tokens/trainable': 1613354, 'tokens/total': 3529882, 'epoch': '0.4425'}

 44%|████████████████████████████████████████████████████████████████▊                                                                                  | 116/263 [1:25:48<1:57:32, 47.97s/it]
 44%|█████████████████████████████████████████████████████████████████▍                                                                                 | 117/263 [1:26:52<2:08:08, 52.66s/it]
                                                                                                                                                                                              
{'loss': '0.6484', 'grad_norm': '0.1659', 'learning_rate': '0.0001369', 'ppl': '1.913', 'memory/max_active (GiB)': '54.45', 'memory/max_allocated (GiB)': '54.45', 'memory/device_reserved (GiB)': '77.33', 'tokens/train_per_sec_per_gpu': '272', 'tokens/trainable': 1630658, 'tokens/total': 3566156, 'epoch': '0.4464'}

 44%|█████████████████████████████████████████████████████████████████▍                                                                                 | 117/263 [1:26:52<2:08:08, 52.66s/it]
 45%|█████████████████████████████████████████████████████████████████▉                                                                                 | 118/263 [1:27:52<2:12:36, 54.87s/it]
                                                                                                                                                                                              
{'loss': '0.611', 'grad_norm': '0.1486', 'learning_rate': '0.0001357', 'ppl': '1.842', 'memory/max_active (GiB)': '43.75', 'memory/max_allocated (GiB)': '43.75', 'memory/device_reserved (GiB)': '61.14', 'tokens/train_per_sec_per_gpu': '329.4', 'tokens/trainable': 1650425, 'tokens/total': 3606182, 'epoch': '0.4502'}

 45%|█████████████████████████████████████████████████████████████████▉                                                                                 | 118/263 [1:27:52<2:12:36, 54.87s/it]
 45%|██████████████████████████████████████████████████████████████████▌                                                                                | 119/263 [1:28:50<2:13:58, 55.82s/it]
                                                                                                                                                                                              
{'loss': '0.6299', 'grad_norm': '0.1702', 'learning_rate': '0.0001344', 'ppl': '1.877', 'memory/max_active (GiB)': '60.94', 'memory/max_allocated (GiB)': '60.94', 'memory/device_reserved (GiB)': '69.08', 'tokens/train_per_sec_per_gpu': '289.6', 'tokens/trainable': 1667235, 'tokens/total': 3645356, 'epoch': '0.454'}

 45%|██████████████████████████████████████████████████████████████████▌                                                                                | 119/263 [1:28:50<2:13:58, 55.82s/it]
 46%|███████████████████████████████████████████████████████████████████                                                                                | 120/263 [1:29:35<2:05:41, 52.74s/it]
                                                                                                                                                                                              
{'loss': '0.7218', 'grad_norm': '0.1997', 'learning_rate': '0.0001332', 'ppl': '2.058', 'memory/max_active (GiB)': '37.26', 'memory/max_allocated (GiB)': '37.26', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '310.7', 'tokens/trainable': 1681386, 'tokens/total': 3674268, 'epoch': '0.4578'}

 46%|███████████████████████████████████████████████████████████████████                                                                                | 120/263 [1:29:35<2:05:41, 52.74s/it]
 46%|███████████████████████████████████████████████████████████████████▋                                                                               | 121/263 [1:30:33<2:08:14, 54.19s/it]
                                                                                                                                                                                              
{'loss': '0.6603', 'grad_norm': '0.2087', 'learning_rate': '0.0001319', 'ppl': '1.935', 'memory/max_active (GiB)': '42.12', 'memory/max_allocated (GiB)': '42.12', 'memory/device_reserved (GiB)': '58.63', 'tokens/train_per_sec_per_gpu': '268.9', 'tokens/trainable': 1696869, 'tokens/total': 3710714, 'epoch': '0.4616'}

 46%|███████████████████████████████████████████████████████████████████▋                                                                               | 121/263 [1:30:33<2:08:14, 54.19s/it]
 46%|████████████████████████████████████████████████████████████████████▏                                                                              | 122/263 [1:31:17<2:00:10, 51.14s/it]
                                                                                                                                                                                              
{'loss': '0.7366', 'grad_norm': '0.2266', 'learning_rate': '0.0001306', 'ppl': '2.089', 'memory/max_active (GiB)': '23.04', 'memory/max_allocated (GiB)': '23.04', 'memory/device_reserved (GiB)': '45.24', 'tokens/train_per_sec_per_gpu': '206.1', 'tokens/trainable': 1705941, 'tokens/total': 3730260, 'epoch': '0.4654'}

 46%|████████████████████████████████████████████████████████████████████▏                                                                              | 122/263 [1:31:17<2:00:10, 51.14s/it]
 47%|████████████████████████████████████████████████████████████████████▋                                                                              | 123/263 [1:32:05<1:57:21, 50.29s/it]
                                                                                                                                                                                              
{'loss': '0.7187', 'grad_norm': '0.164', 'learning_rate': '0.0001294', 'ppl': '2.052', 'memory/max_active (GiB)': '31.57', 'memory/max_allocated (GiB)': '31.57', 'memory/device_reserved (GiB)': '41.88', 'tokens/train_per_sec_per_gpu': '350.2', 'tokens/trainable': 1722869, 'tokens/total': 3764600, 'epoch': '0.4692'}

 47%|████████████████████████████████████████████████████████████████████▋                                                                              | 123/263 [1:32:05<1:57:21, 50.29s/it]
 47%|█████████████████████████████████████████████████████████████████████▎                                                                             | 124/263 [1:32:54<1:55:10, 49.72s/it]
                                                                                                                                                                                              
{'loss': '0.7666', 'grad_norm': '0.2024', 'learning_rate': '0.0001281', 'ppl': '2.152', 'memory/max_active (GiB)': '28.74', 'memory/max_allocated (GiB)': '28.74', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '259.1', 'tokens/trainable': 1735401, 'tokens/total': 3789866, 'epoch': '0.4731'}

 47%|█████████████████████████████████████████████████████████████████████▎                                                                             | 124/263 [1:32:54<1:55:10, 49.72s/it]
 48%|█████████████████████████████████████████████████████████████████████▊                                                                             | 125/263 [1:33:56<2:03:06, 53.52s/it]
                                                                                                                                                                                              
{'loss': '0.6747', 'grad_norm': '0.1734', 'learning_rate': '0.0001268', 'ppl': '1.963', 'memory/max_active (GiB)': '36.35', 'memory/max_allocated (GiB)': '36.35', 'memory/device_reserved (GiB)': '45.94', 'tokens/train_per_sec_per_gpu': '266.3', 'tokens/trainable': 1752018, 'tokens/total': 3826538, 'epoch': '0.4769'}

 48%|█████████████████████████████████████████████████████████████████████▊                                                                             | 125/263 [1:33:56<2:03:06, 53.52s/it]
 48%|██████████████████████████████████████████████████████████████████████▍                                                                            | 126/263 [1:34:58<2:07:43, 55.94s/it]
                                                                                                                                                                                              
{'loss': '0.7026', 'grad_norm': '0.1709', 'learning_rate': '0.0001256', 'ppl': '2.019', 'memory/max_active (GiB)': '34.18', 'memory/max_allocated (GiB)': '34.18', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '257.9', 'tokens/trainable': 1767897, 'tokens/total': 3861520, 'epoch': '0.4807'}

 48%|██████████████████████████████████████████████████████████████████████▍                                                                            | 126/263 [1:34:58<2:07:43, 55.94s/it]
 48%|██████████████████████████████████████████████████████████████████████▉                                                                            | 127/263 [1:35:43<1:59:53, 52.89s/it]
                                                                                                                                                                                              
{'loss': '0.7321', 'grad_norm': '0.2378', 'learning_rate': '0.0001243', 'ppl': '2.079', 'memory/max_active (GiB)': '26.64', 'memory/max_allocated (GiB)': '26.64', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '211', 'tokens/trainable': 1777560, 'tokens/total': 3883630, 'epoch': '0.4845'}

 48%|██████████████████████████████████████████████████████████████████████▉                                                                            | 127/263 [1:35:43<1:59:53, 52.89s/it]
 49%|███████████████████████████████████████████████████████████████████████▌                                                                           | 128/263 [1:36:21<1:48:48, 48.36s/it]
                                                                                                                                                                                              
{'loss': '0.6494', 'grad_norm': '0.1878', 'learning_rate': '0.000123', 'ppl': '1.914', 'memory/max_active (GiB)': '23.99', 'memory/max_allocated (GiB)': '23.99', 'memory/device_reserved (GiB)': '31.15', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 1789026, 'tokens/total': 3907826, 'epoch': '0.4883'}

 49%|███████████████████████████████████████████████████████████████████████▌                                                                           | 128/263 [1:36:21<1:48:48, 48.36s/it]
 49%|████████████████████████████████████████████████████████████████████████                                                                           | 129/263 [1:37:18<1:53:46, 50.95s/it]
                                                                                                                                                                                              
{'loss': '0.6286', 'grad_norm': '0.1865', 'learning_rate': '0.0001217', 'ppl': '1.875', 'memory/max_active (GiB)': '34.03', 'memory/max_allocated (GiB)': '34.03', 'memory/device_reserved (GiB)': '42.72', 'tokens/train_per_sec_per_gpu': '281.1', 'tokens/trainable': 1805039, 'tokens/total': 3942120, 'epoch': '0.4921'}

 49%|████████████████████████████████████████████████████████████████████████                                                                           | 129/263 [1:37:18<1:53:46, 50.95s/it]
 49%|████████████████████████████████████████████████████████████████████████▋                                                                          | 130/263 [1:38:19<1:59:46, 54.03s/it]
                                                                                                                                                                                              
{'loss': '0.5905', 'grad_norm': '0.1831', 'learning_rate': '0.0001204', 'ppl': '1.805', 'memory/max_active (GiB)': '40.32', 'memory/max_allocated (GiB)': '40.32', 'memory/device_reserved (GiB)': '55.9', 'tokens/train_per_sec_per_gpu': '282.4', 'tokens/trainable': 1822330, 'tokens/total': 3976238, 'epoch': '0.4959'}

 49%|████████████████████████████████████████████████████████████████████████▋                                                                          | 130/263 [1:38:19<1:59:46, 54.03s/it]
 50%|█████████████████████████████████████████████████████████████████████████▏                                                                         | 131/263 [1:39:03<1:52:00, 50.91s/it]
                                                                                                                                                                                              
{'loss': '0.6443', 'grad_norm': '0.1812', 'learning_rate': '0.0001191', 'ppl': '1.905', 'memory/max_active (GiB)': '33.09', 'memory/max_allocated (GiB)': '33.09', 'memory/device_reserved (GiB)': '41.43', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 1835502, 'tokens/total': 4004024, 'epoch': '0.4998'}

 50%|█████████████████████████████████████████████████████████████████████████▏                                                                         | 131/263 [1:39:03<1:52:00, 50.91s/it]
 50%|█████████████████████████████████████████████████████████████████████████▊                                                                         | 132/263 [1:40:07<1:59:46, 54.86s/it]
                                                                                                                                                                                              
{'loss': '0.6177', 'grad_norm': '0.158', 'learning_rate': '0.0001178', 'ppl': '1.855', 'memory/max_active (GiB)': '44.4', 'memory/max_allocated (GiB)': '44.4', 'memory/device_reserved (GiB)': '62.21', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 1854705, 'tokens/total': 4047366, 'epoch': '0.5036'}

 50%|█████████████████████████████████████████████████████████████████████████▊                                                                         | 132/263 [1:40:07<1:59:46, 54.86s/it]
 51%|██████████████████████████████████████████████████████████████████████████▎                                                                        | 133/263 [1:41:01<1:58:27, 54.67s/it]
                                                                                                                                                                                              
{'loss': '0.6597', 'grad_norm': '0.197', 'learning_rate': '0.0001165', 'ppl': '1.934', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '225.4', 'tokens/trainable': 1866931, 'tokens/total': 4073028, 'epoch': '0.5074'}

 51%|██████████████████████████████████████████████████████████████████████████▎                                                                        | 133/263 [1:41:01<1:58:27, 54.67s/it]
 51%|██████████████████████████████████████████████████████████████████████████▉                                                                        | 134/263 [1:42:00<2:00:25, 56.01s/it]
                                                                                                                                                                                              
{'loss': '0.6264', 'grad_norm': '0.1803', 'learning_rate': '0.0001152', 'ppl': '1.871', 'memory/max_active (GiB)': '35.15', 'memory/max_allocated (GiB)': '35.15', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '217.6', 'tokens/trainable': 1879795, 'tokens/total': 4103326, 'epoch': '0.5112'}

 51%|██████████████████████████████████████████████████████████████████████████▉                                                                        | 134/263 [1:42:01<2:00:25, 56.01s/it]
 51%|███████████████████████████████████████████████████████████████████████████▍                                                                       | 135/263 [1:43:08<2:06:45, 59.42s/it]
                                                                                                                                                                                              
{'loss': '0.5894', 'grad_norm': '0.1796', 'learning_rate': '0.0001139', 'ppl': '1.803', 'memory/max_active (GiB)': '60.2', 'memory/max_allocated (GiB)': '60.2', 'memory/device_reserved (GiB)': '78.05', 'tokens/train_per_sec_per_gpu': '280.6', 'tokens/trainable': 1898706, 'tokens/total': 4145556, 'epoch': '0.515'}

 51%|███████████████████████████████████████████████████████████████████████████▍                                                                       | 135/263 [1:43:08<2:06:45, 59.42s/it]
 52%|████████████████████████████████████████████████████████████████████████████                                                                       | 136/263 [1:44:02<2:02:32, 57.90s/it]
                                                                                                                                                                                              
{'loss': '0.6652', 'grad_norm': '0.1649', 'learning_rate': '0.0001126', 'ppl': '1.945', 'memory/max_active (GiB)': '36.24', 'memory/max_allocated (GiB)': '36.24', 'memory/device_reserved (GiB)': '45.69', 'tokens/train_per_sec_per_gpu': '253.7', 'tokens/trainable': 1912491, 'tokens/total': 4178912, 'epoch': '0.5188'}

 52%|████████████████████████████████████████████████████████████████████████████                                                                       | 136/263 [1:44:02<2:02:32, 57.90s/it]
 52%|████████████████████████████████████████████████████████████████████████████▌                                                                      | 137/263 [1:45:08<2:06:33, 60.27s/it]
                                                                                                                                                                                              
{'loss': '0.6049', 'grad_norm': '0.1469', 'learning_rate': '0.0001112', 'ppl': '1.831', 'memory/max_active (GiB)': '54.11', 'memory/max_allocated (GiB)': '54.11', 'memory/device_reserved (GiB)': '77.07', 'tokens/train_per_sec_per_gpu': '291.8', 'tokens/trainable': 1931691, 'tokens/total': 4219728, 'epoch': '0.5227'}

 52%|████████████████████████████████████████████████████████████████████████████▌                                                                      | 137/263 [1:45:08<2:06:33, 60.27s/it]
 52%|█████████████████████████████████████████████████████████████████████████████▏                                                                     | 138/263 [1:45:55<1:57:30, 56.40s/it]
                                                                                                                                                                                              
{'loss': '0.7114', 'grad_norm': '0.219', 'learning_rate': '0.0001099', 'ppl': '2.037', 'memory/max_active (GiB)': '29.78', 'memory/max_allocated (GiB)': '29.78', 'memory/device_reserved (GiB)': '39.28', 'tokens/train_per_sec_per_gpu': '230.8', 'tokens/trainable': 1942627, 'tokens/total': 4246086, 'epoch': '0.5265'}

 52%|█████████████████████████████████████████████████████████████████████████████▏                                                                     | 138/263 [1:45:55<1:57:30, 56.40s/it]
 53%|█████████████████████████████████████████████████████████████████████████████▋                                                                     | 139/263 [1:46:33<1:44:54, 50.76s/it]
                                                                                                                                                                                              
{'loss': '0.6658', 'grad_norm': '0.2107', 'learning_rate': '0.0001086', 'ppl': '1.946', 'memory/max_active (GiB)': '32.15', 'memory/max_allocated (GiB)': '32.15', 'memory/device_reserved (GiB)': '35.63', 'tokens/train_per_sec_per_gpu': '371.6', 'tokens/trainable': 1956598, 'tokens/total': 4274466, 'epoch': '0.5303'}

 53%|█████████████████████████████████████████████████████████████████████████████▋                                                                     | 139/263 [1:46:33<1:44:54, 50.76s/it]
 53%|██████████████████████████████████████████████████████████████████████████████▎                                                                    | 140/263 [1:47:10<1:35:26, 46.56s/it]
                                                                                                                                                                                              
{'loss': '0.6586', 'grad_norm': '0.1947', 'learning_rate': '0.0001073', 'ppl': '1.932', 'memory/max_active (GiB)': '29.82', 'memory/max_allocated (GiB)': '29.82', 'memory/device_reserved (GiB)': '39.44', 'tokens/train_per_sec_per_gpu': '341.7', 'tokens/trainable': 1969155, 'tokens/total': 4302324, 'epoch': '0.5341'}

 53%|██████████████████████████████████████████████████████████████████████████████▎                                                                    | 140/263 [1:47:10<1:35:26, 46.56s/it]
 54%|██████████████████████████████████████████████████████████████████████████████▊                                                                    | 141/263 [1:47:52<1:32:13, 45.36s/it]
                                                                                                                                                                                              
{'loss': '0.7005', 'grad_norm': '0.2028', 'learning_rate': '0.000106', 'ppl': '2.015', 'memory/max_active (GiB)': '30.7', 'memory/max_allocated (GiB)': '30.7', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '272.9', 'tokens/trainable': 1980769, 'tokens/total': 4330164, 'epoch': '0.5379'}

 54%|██████████████████████████████████████████████████████████████████████████████▊                                                                    | 141/263 [1:47:52<1:32:13, 45.36s/it]
 54%|███████████████████████████████████████████████████████████████████████████████▎                                                                   | 142/263 [1:48:40<1:33:08, 46.19s/it]
                                                                                                                                                                                              
{'loss': '0.6936', 'grad_norm': '0.207', 'learning_rate': '0.0001046', 'ppl': '2.001', 'memory/max_active (GiB)': '30.38', 'memory/max_allocated (GiB)': '30.38', 'memory/device_reserved (GiB)': '40.1', 'tokens/train_per_sec_per_gpu': '213.5', 'tokens/trainable': 1991044, 'tokens/total': 4352668, 'epoch': '0.5417'}

 54%|███████████████████████████████████████████████████████████████████████████████▎                                                                   | 142/263 [1:48:40<1:33:08, 46.19s/it]
 54%|███████████████████████████████████████████████████████████████████████████████▉                                                                   | 143/263 [1:49:28<1:32:57, 46.48s/it]
                                                                                                                                                                                              
{'loss': '0.7092', 'grad_norm': '0.1816', 'learning_rate': '0.0001033', 'ppl': '2.032', 'memory/max_active (GiB)': '33.08', 'memory/max_allocated (GiB)': '33.08', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '325.4', 'tokens/trainable': 2006393, 'tokens/total': 4386038, 'epoch': '0.5455'}

 54%|███████████████████████████████████████████████████████████████████████████████▉                                                                   | 143/263 [1:49:28<1:32:57, 46.48s/it]
 55%|████████████████████████████████████████████████████████████████████████████████▍                                                                  | 144/263 [1:49:54<1:20:19, 40.50s/it]
                                                                                                                                                                                              
{'loss': '0.6535', 'grad_norm': '0.2587', 'learning_rate': '0.000102', 'ppl': '1.922', 'memory/max_active (GiB)': '24', 'memory/max_allocated (GiB)': '24', 'memory/device_reserved (GiB)': '30.59', 'tokens/train_per_sec_per_gpu': '326.8', 'tokens/trainable': 2015072, 'tokens/total': 4404946, 'epoch': '0.5494'}

 55%|████████████████████████████████████████████████████████████████████████████████▍                                                                  | 144/263 [1:49:54<1:20:19, 40.50s/it]
 55%|█████████████████████████████████████████████████████████████████████████████████                                                                  | 145/263 [1:50:25<1:13:46, 37.51s/it]
                                                                                                                                                                                              
{'loss': '0.6609', 'grad_norm': '0.2187', 'learning_rate': '0.0001007', 'ppl': '1.937', 'memory/max_active (GiB)': '24.8', 'memory/max_allocated (GiB)': '24.8', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '344.2', 'tokens/trainable': 2025583, 'tokens/total': 4426472, 'epoch': '0.5532'}

 55%|█████████████████████████████████████████████████████████████████████████████████                                                                  | 145/263 [1:50:25<1:13:46, 37.51s/it]
 56%|█████████████████████████████████████████████████████████████████████████████████▌                                                                 | 146/263 [1:51:10<1:17:32, 39.76s/it]
                                                                                                                                                                                              
{'loss': '0.6478', 'grad_norm': '0.2216', 'learning_rate': '9.934e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '28.07', 'memory/max_allocated (GiB)': '28.07', 'memory/device_reserved (GiB)': '36.78', 'tokens/train_per_sec_per_gpu': '212.7', 'tokens/trainable': 2035157, 'tokens/total': 4447532, 'epoch': '0.557'}

 56%|█████████████████████████████████████████████████████████████████████████████████▌                                                                 | 146/263 [1:51:10<1:17:32, 39.76s/it]
 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                | 147/263 [1:51:59<1:22:08, 42.49s/it]
                                                                                                                                                                                              
{'loss': '0.6784', 'grad_norm': '0.2246', 'learning_rate': '9.801e-05', 'ppl': '1.971', 'memory/max_active (GiB)': '24.88', 'memory/max_allocated (GiB)': '24.88', 'memory/device_reserved (GiB)': '31.92', 'tokens/train_per_sec_per_gpu': '225.8', 'tokens/trainable': 2046191, 'tokens/total': 4469110, 'epoch': '0.5608'}

 56%|██████████████████████████████████████████████████████████████████████████████████▏                                                                | 147/263 [1:51:59<1:22:08, 42.49s/it]
 56%|██████████████████████████████████████████████████████████████████████████████████▋                                                                | 148/263 [1:52:52<1:27:37, 45.72s/it]
                                                                                                                                                                                              
{'loss': '0.6373', 'grad_norm': '0.1724', 'learning_rate': '9.669e-05', 'ppl': '1.891', 'memory/max_active (GiB)': '44.04', 'memory/max_allocated (GiB)': '44.04', 'memory/device_reserved (GiB)': '61.51', 'tokens/train_per_sec_per_gpu': '327.3', 'tokens/trainable': 2063617, 'tokens/total': 4505868, 'epoch': '0.5646'}

 56%|██████████████████████████████████████████████████████████████████████████████████▋                                                                | 148/263 [1:52:52<1:27:37, 45.72s/it]
 57%|███████████████████████████████████████████████████████████████████████████████████▎                                                               | 149/263 [1:53:39<1:27:45, 46.19s/it]
                                                                                                                                                                                              
{'loss': '0.5898', 'grad_norm': '0.1733', 'learning_rate': '9.536e-05', 'ppl': '1.804', 'memory/max_active (GiB)': '31.09', 'memory/max_allocated (GiB)': '31.09', 'memory/device_reserved (GiB)': '37.29', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 2078116, 'tokens/total': 4537032, 'epoch': '0.5684'}

 57%|███████████████████████████████████████████████████████████████████████████████████▎                                                               | 149/263 [1:53:39<1:27:45, 46.19s/it]
 57%|███████████████████████████████████████████████████████████████████████████████████▊                                                               | 150/263 [1:54:25<1:26:36, 45.98s/it]
                                                                                                                                                                                              
{'loss': '0.5857', 'grad_norm': '0.199', 'learning_rate': '9.404e-05', 'ppl': '1.796', 'memory/max_active (GiB)': '36.3', 'memory/max_allocated (GiB)': '36.3', 'memory/device_reserved (GiB)': '45.92', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 2089719, 'tokens/total': 4564268, 'epoch': '0.5722'}

 57%|███████████████████████████████████████████████████████████████████████████████████▊                                                               | 150/263 [1:54:25<1:26:36, 45.98s/it]
 57%|████████████████████████████████████████████████████████████████████████████████████▍                                                              | 151/263 [1:55:16<1:29:04, 47.72s/it]
                                                                                                                                                                                              
{'loss': '0.6078', 'grad_norm': '0.2164', 'learning_rate': '9.272e-05', 'ppl': '1.836', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '259.3', 'tokens/trainable': 2103145, 'tokens/total': 4593966, 'epoch': '0.5761'}

 57%|████████████████████████████████████████████████████████████████████████████████████▍                                                              | 151/263 [1:55:16<1:29:04, 47.72s/it]
 58%|████████████████████████████████████████████████████████████████████████████████████▉                                                              | 152/263 [1:56:03<1:27:31, 47.31s/it]
                                                                                                                                                                                              
{'loss': '0.7859', 'grad_norm': '0.2286', 'learning_rate': '9.139e-05', 'ppl': '2.194', 'memory/max_active (GiB)': '25.58', 'memory/max_allocated (GiB)': '25.58', 'memory/device_reserved (GiB)': '33.29', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 2113624, 'tokens/total': 4616602, 'epoch': '0.5799'}

 58%|████████████████████████████████████████████████████████████████████████████████████▉                                                              | 152/263 [1:56:03<1:27:31, 47.31s/it]
 58%|█████████████████████████████████████████████████████████████████████████████████████▌                                                             | 153/263 [1:56:39<1:20:33, 43.94s/it]
                                                                                                                                                                                              
{'loss': '0.6501', 'grad_norm': '0.2131', 'learning_rate': '9.007e-05', 'ppl': '1.916', 'memory/max_active (GiB)': '23.47', 'memory/max_allocated (GiB)': '23.47', 'memory/device_reserved (GiB)': '29.78', 'tokens/train_per_sec_per_gpu': '255.8', 'tokens/trainable': 2122849, 'tokens/total': 4636788, 'epoch': '0.5837'}

 58%|█████████████████████████████████████████████████████████████████████████████████████▌                                                             | 153/263 [1:56:39<1:20:33, 43.94s/it]
 59%|██████████████████████████████████████████████████████████████████████████████████████                                                             | 154/263 [1:57:30<1:23:40, 46.06s/it]
                                                                                                                                                                                              
{'loss': '0.7322', 'grad_norm': '0.2115', 'learning_rate': '8.876e-05', 'ppl': '2.08', 'memory/max_active (GiB)': '38.63', 'memory/max_allocated (GiB)': '38.63', 'memory/device_reserved (GiB)': '53.38', 'tokens/train_per_sec_per_gpu': '324', 'tokens/trainable': 2139377, 'tokens/total': 4672022, 'epoch': '0.5875'}

 59%|██████████████████████████████████████████████████████████████████████████████████████                                                             | 154/263 [1:57:30<1:23:40, 46.06s/it]
 59%|██████████████████████████████████████████████████████████████████████████████████████▋                                                            | 155/263 [1:58:23<1:26:31, 48.07s/it]
                                                                                                                                                                                              
{'loss': '0.6393', 'grad_norm': '0.1979', 'learning_rate': '8.744e-05', 'ppl': '1.895', 'memory/max_active (GiB)': '39.73', 'memory/max_allocated (GiB)': '39.73', 'memory/device_reserved (GiB)': '54.95', 'tokens/train_per_sec_per_gpu': '278.7', 'tokens/trainable': 2154077, 'tokens/total': 4701432, 'epoch': '0.5913'}

 59%|██████████████████████████████████████████████████████████████████████████████████████▋                                                            | 155/263 [1:58:23<1:26:31, 48.07s/it]
 59%|███████████████████████████████████████████████████████████████████████████████████████▏                                                           | 156/263 [1:59:20<1:30:31, 50.76s/it]
                                                                                                                                                                                              
{'loss': '0.6361', 'grad_norm': '0.1782', 'learning_rate': '8.613e-05', 'ppl': '1.889', 'memory/max_active (GiB)': '42.74', 'memory/max_allocated (GiB)': '42.74', 'memory/device_reserved (GiB)': '59.9', 'tokens/train_per_sec_per_gpu': '296.8', 'tokens/trainable': 2171008, 'tokens/total': 4736142, 'epoch': '0.5951'}

 59%|███████████████████████████████████████████████████████████████████████████████████████▏                                                           | 156/263 [1:59:20<1:30:31, 50.76s/it]
 60%|███████████████████████████████████████████████████████████████████████████████████████▊                                                           | 157/263 [2:00:09<1:28:43, 50.22s/it]
                                                                                                                                                                                              
{'loss': '0.7478', 'grad_norm': '0.185', 'learning_rate': '8.481e-05', 'ppl': '2.112', 'memory/max_active (GiB)': '39.56', 'memory/max_allocated (GiB)': '39.56', 'memory/device_reserved (GiB)': '54.81', 'tokens/train_per_sec_per_gpu': '291.1', 'tokens/trainable': 2185260, 'tokens/total': 4769674, 'epoch': '0.599'}

 60%|███████████████████████████████████████████████████████████████████████████████████████▊                                                           | 157/263 [2:00:09<1:28:43, 50.22s/it]
 60%|████████████████████████████████████████████████████████████████████████████████████████▎                                                          | 158/263 [2:01:16<1:36:55, 55.38s/it]
                                                                                                                                                                                              
{'loss': '0.6923', 'grad_norm': '0.1927', 'learning_rate': '8.351e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '51.52', 'memory/max_allocated (GiB)': '51.52', 'memory/device_reserved (GiB)': '73.17', 'tokens/train_per_sec_per_gpu': '299.4', 'tokens/trainable': 2205449, 'tokens/total': 4813544, 'epoch': '0.6028'}

 60%|████████████████████████████████████████████████████████████████████████████████████████▎                                                          | 158/263 [2:01:16<1:36:55, 55.38s/it]
 60%|████████████████████████████████████████████████████████████████████████████████████████▊                                                          | 159/263 [2:02:19<1:39:59, 57.69s/it]
                                                                                                                                                                                              
{'loss': '0.6566', 'grad_norm': '0.1942', 'learning_rate': '8.22e-05', 'ppl': '1.928', 'memory/max_active (GiB)': '45.07', 'memory/max_allocated (GiB)': '45.07', 'memory/device_reserved (GiB)': '63.03', 'tokens/train_per_sec_per_gpu': '265.1', 'tokens/trainable': 2222171, 'tokens/total': 4848392, 'epoch': '0.6066'}

 60%|████████████████████████████████████████████████████████████████████████████████████████▊                                                          | 159/263 [2:02:19<1:39:59, 57.69s/it]
 61%|█████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 160/263 [2:03:11<1:36:19, 56.11s/it]
                                                                                                                                                                                              
{'loss': '0.6622', 'grad_norm': '0.1977', 'learning_rate': '8.09e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '46.42', 'memory/max_allocated (GiB)': '46.42', 'memory/device_reserved (GiB)': '65.06', 'tokens/train_per_sec_per_gpu': '337.5', 'tokens/trainable': 2239867, 'tokens/total': 4886222, 'epoch': '0.6104'}

 61%|█████████████████████████████████████████████████████████████████████████████████████████▍                                                         | 160/263 [2:03:11<1:36:19, 56.11s/it]
 61%|█████████████████████████████████████████████████████████████████████████████████████████▉                                                         | 161/263 [2:03:47<1:24:41, 49.82s/it]
                                                                                                                                                                                              
{'loss': '0.6882', 'grad_norm': '0.1678', 'learning_rate': '7.96e-05', 'ppl': '1.99', 'memory/max_active (GiB)': '25.15', 'memory/max_allocated (GiB)': '25.15', 'memory/device_reserved (GiB)': '32.47', 'tokens/train_per_sec_per_gpu': '420.2', 'tokens/trainable': 2254631, 'tokens/total': 4914572, 'epoch': '0.6142'}

 61%|█████████████████████████████████████████████████████████████████████████████████████████▉                                                         | 161/263 [2:03:47<1:24:41, 49.82s/it]
 62%|██████████████████████████████████████████████████████████████████████████████████████████▌                                                        | 162/263 [2:04:31<1:21:04, 48.17s/it]
                                                                                                                                                                                              
{'loss': '0.6634', 'grad_norm': '0.254', 'learning_rate': '7.83e-05', 'ppl': '1.941', 'memory/max_active (GiB)': '25.88', 'memory/max_allocated (GiB)': '25.88', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '195.9', 'tokens/trainable': 2263311, 'tokens/total': 4934728, 'epoch': '0.618'}

 62%|██████████████████████████████████████████████████████████████████████████████████████████▌                                                        | 162/263 [2:04:31<1:21:04, 48.17s/it]
 62%|███████████████████████████████████████████████████████████████████████████████████████████                                                        | 163/263 [2:05:28<1:24:55, 50.96s/it]
                                                                                                                                                                                              
{'loss': '0.6232', 'grad_norm': '0.1764', 'learning_rate': '7.701e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '33.21', 'memory/max_allocated (GiB)': '33.21', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '270', 'tokens/trainable': 2278829, 'tokens/total': 4967958, 'epoch': '0.6218'}

 62%|███████████████████████████████████████████████████████████████████████████████████████████                                                        | 163/263 [2:05:28<1:24:55, 50.96s/it]
 62%|███████████████████████████████████████████████████████████████████████████████████████████▋                                                       | 164/263 [2:06:01<1:14:46, 45.32s/it]
                                                                                                                                                                                              
{'loss': '0.7008', 'grad_norm': '0.1909', 'learning_rate': '7.572e-05', 'ppl': '2.015', 'memory/max_active (GiB)': '27.44', 'memory/max_allocated (GiB)': '27.44', 'memory/device_reserved (GiB)': '35.74', 'tokens/train_per_sec_per_gpu': '401.3', 'tokens/trainable': 2291731, 'tokens/total': 4994350, 'epoch': '0.6257'}

 62%|███████████████████████████████████████████████████████████████████████████████████████████▋                                                       | 164/263 [2:06:01<1:14:46, 45.32s/it]
 63%|████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 165/263 [2:06:33<1:07:45, 41.49s/it]
                                                                                                                                                                                              
{'loss': '0.6839', 'grad_norm': '0.213', 'learning_rate': '7.444e-05', 'ppl': '1.982', 'memory/max_active (GiB)': '31.91', 'memory/max_allocated (GiB)': '31.91', 'memory/device_reserved (GiB)': '42.43', 'tokens/train_per_sec_per_gpu': '404.2', 'tokens/trainable': 2304893, 'tokens/total': 5018514, 'epoch': '0.6295'}

 63%|████████████████████████████████████████████████████████████████████████████████████████████▏                                                      | 165/263 [2:06:33<1:07:45, 41.49s/it]
 63%|████████████████████████████████████████████████████████████████████████████████████████████▊                                                      | 166/263 [2:07:29<1:13:54, 45.72s/it]
                                                                                                                                                                                              
{'loss': '0.6054', 'grad_norm': '0.1739', 'learning_rate': '7.316e-05', 'ppl': '1.832', 'memory/max_active (GiB)': '35.17', 'memory/max_allocated (GiB)': '35.17', 'memory/device_reserved (GiB)': '44.2', 'tokens/train_per_sec_per_gpu': '285.8', 'tokens/trainable': 2320783, 'tokens/total': 5050822, 'epoch': '0.6333'}

 63%|████████████████████████████████████████████████████████████████████████████████████████████▊                                                      | 166/263 [2:07:29<1:13:54, 45.72s/it]
 63%|█████████████████████████████████████████████████████████████████████████████████████████████▎                                                     | 167/263 [2:08:27<1:19:11, 49.50s/it]
                                                                                                                                                                                              
{'loss': '0.6887', 'grad_norm': '0.1949', 'learning_rate': '7.188e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '40.38', 'memory/max_allocated (GiB)': '40.38', 'memory/device_reserved (GiB)': '56.06', 'tokens/train_per_sec_per_gpu': '304.9', 'tokens/trainable': 2338559, 'tokens/total': 5082648, 'epoch': '0.6371'}

 63%|█████████████████████████████████████████████████████████████████████████████████████████████▎                                                     | 167/263 [2:08:27<1:19:11, 49.50s/it]
 64%|█████████████████████████████████████████████████████████████████████████████████████████████▉                                                     | 168/263 [2:09:09<1:14:44, 47.20s/it]
                                                                                                                                                                                              
{'loss': '0.6576', 'grad_norm': '0.246', 'learning_rate': '7.061e-05', 'ppl': '1.93', 'memory/max_active (GiB)': '22.77', 'memory/max_allocated (GiB)': '22.77', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '217.1', 'tokens/trainable': 2347641, 'tokens/total': 5103024, 'epoch': '0.6409'}

 64%|█████████████████████████████████████████████████████████████████████████████████████████████▉                                                     | 168/263 [2:09:09<1:14:44, 47.20s/it]
 64%|██████████████████████████████████████████████████████████████████████████████████████████████▍                                                    | 169/263 [2:09:58<1:14:40, 47.66s/it]
                                                                                                                                                                                              
{'loss': '0.5939', 'grad_norm': '0.1881', 'learning_rate': '6.935e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.16', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 2362246, 'tokens/total': 5135012, 'epoch': '0.6447'}

 64%|██████████████████████████████████████████████████████████████████████████████████████████████▍                                                    | 169/263 [2:09:58<1:14:40, 47.66s/it]
 65%|███████████████████████████████████████████████████████████████████████████████████████████████                                                    | 170/263 [2:10:42<1:12:29, 46.77s/it]
                                                                                                                                                                                              
{'loss': '0.6675', 'grad_norm': '0.1941', 'learning_rate': '6.809e-05', 'ppl': '1.949', 'memory/max_active (GiB)': '40.07', 'memory/max_allocated (GiB)': '40.07', 'memory/device_reserved (GiB)': '55.49', 'tokens/train_per_sec_per_gpu': '330.2', 'tokens/trainable': 2376998, 'tokens/total': 5165914, 'epoch': '0.6485'}

 65%|███████████████████████████████████████████████████████████████████████████████████████████████                                                    | 170/263 [2:10:42<1:12:29, 46.77s/it]
 65%|███████████████████████████████████████████████████████████████████████████████████████████████▌                                                   | 171/263 [2:11:22<1:08:18, 44.54s/it]
                                                                                                                                                                                              
{'loss': '0.662', 'grad_norm': '0.2005', 'learning_rate': '6.684e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '26.99', 'memory/max_allocated (GiB)': '26.99', 'memory/device_reserved (GiB)': '35.18', 'tokens/train_per_sec_per_gpu': '334.3', 'tokens/trainable': 2390156, 'tokens/total': 5192072, 'epoch': '0.6524'}

 65%|███████████████████████████████████████████████████████████████████████████████████████████████▌                                                   | 171/263 [2:11:22<1:08:18, 44.54s/it]
 65%|████████████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 172/263 [2:12:13<1:10:49, 46.70s/it]
                                                                                                                                                                                              
{'loss': '0.7026', 'grad_norm': '0.2259', 'learning_rate': '6.559e-05', 'ppl': '2.019', 'memory/max_active (GiB)': '47.05', 'memory/max_allocated (GiB)': '47.05', 'memory/device_reserved (GiB)': '66.2', 'tokens/train_per_sec_per_gpu': '264.1', 'tokens/trainable': 2403814, 'tokens/total': 5223520, 'epoch': '0.6562'}

 65%|████████████████████████████████████████████████████████████████████████████████████████████████▏                                                  | 172/263 [2:12:13<1:10:49, 46.70s/it]
 66%|████████████████████████████████████████████████████████████████████████████████████████████████▋                                                  | 173/263 [2:13:04<1:11:49, 47.89s/it]
                                                                                                                                                                                              
{'loss': '0.6707', 'grad_norm': '0.197', 'learning_rate': '6.435e-05', 'ppl': '1.956', 'memory/max_active (GiB)': '23.56', 'memory/max_allocated (GiB)': '23.56', 'memory/device_reserved (GiB)': '30.05', 'tokens/train_per_sec_per_gpu': '217.8', 'tokens/trainable': 2414847, 'tokens/total': 5247654, 'epoch': '0.66'}

 66%|████████████████████████████████████████████████████████████████████████████████████████████████▋                                                  | 173/263 [2:13:04<1:11:49, 47.89s/it]
 66%|█████████████████████████████████████████████████████████████████████████████████████████████████▎                                                 | 174/263 [2:13:57<1:13:05, 49.28s/it]
                                                                                                                                                                                              
{'loss': '0.5625', 'grad_norm': '0.1798', 'learning_rate': '6.311e-05', 'ppl': '1.755', 'memory/max_active (GiB)': '30.71', 'memory/max_allocated (GiB)': '30.71', 'memory/device_reserved (GiB)': '40.59', 'tokens/train_per_sec_per_gpu': '295.5', 'tokens/trainable': 2430368, 'tokens/total': 5276410, 'epoch': '0.6638'}

 66%|█████████████████████████████████████████████████████████████████████████████████████████████████▎                                                 | 174/263 [2:13:57<1:13:05, 49.28s/it]
 67%|█████████████████████████████████████████████████████████████████████████████████████████████████▊                                                 | 175/263 [2:14:44<1:11:22, 48.66s/it]
                                                                                                                                                                                              
{'loss': '0.6375', 'grad_norm': '0.1961', 'learning_rate': '6.188e-05', 'ppl': '1.892', 'memory/max_active (GiB)': '28.14', 'memory/max_allocated (GiB)': '28.14', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '234.5', 'tokens/trainable': 2441441, 'tokens/total': 5302790, 'epoch': '0.6676'}

 67%|█████████████████████████████████████████████████████████████████████████████████████████████████▊                                                 | 175/263 [2:14:44<1:11:22, 48.66s/it]
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▎                                                | 176/263 [2:15:37<1:12:44, 50.17s/it]
                                                                                                                                                                                              
{'loss': '0.6539', 'grad_norm': '0.1857', 'learning_rate': '6.066e-05', 'ppl': '1.923', 'memory/max_active (GiB)': '44.34', 'memory/max_allocated (GiB)': '44.34', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '260.5', 'tokens/trainable': 2455423, 'tokens/total': 5333674, 'epoch': '0.6714'}

 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▎                                                | 176/263 [2:15:37<1:12:44, 50.17s/it]
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 177/263 [2:16:24<1:10:15, 49.01s/it]
                                                                                                                                                                                              
{'loss': '0.5762', 'grad_norm': '0.1929', 'learning_rate': '5.945e-05', 'ppl': '1.779', 'memory/max_active (GiB)': '28.62', 'memory/max_allocated (GiB)': '28.62', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '251.4', 'tokens/trainable': 2467068, 'tokens/total': 5359564, 'epoch': '0.6753'}

 67%|██████████████████████████████████████████████████████████████████████████████████████████████████▉                                                | 177/263 [2:16:24<1:10:15, 49.01s/it]
 68%|███████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 178/263 [2:17:02<1:05:00, 45.89s/it]
                                                                                                                                                                                              
{'loss': '0.6303', 'grad_norm': '0.1936', 'learning_rate': '5.824e-05', 'ppl': '1.878', 'memory/max_active (GiB)': '33.46', 'memory/max_allocated (GiB)': '33.46', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '384.4', 'tokens/trainable': 2481907, 'tokens/total': 5388678, 'epoch': '0.6791'}

 68%|███████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 178/263 [2:17:02<1:05:00, 45.89s/it]
 68%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 179/263 [2:17:37<59:27, 42.47s/it]
                                                                                                                                                                                              
{'loss': '0.6012', 'grad_norm': '0.203', 'learning_rate': '5.704e-05', 'ppl': '1.824', 'memory/max_active (GiB)': '25.76', 'memory/max_allocated (GiB)': '25.76', 'memory/device_reserved (GiB)': '33.21', 'tokens/train_per_sec_per_gpu': '338.2', 'tokens/trainable': 2493569, 'tokens/total': 5414394, 'epoch': '0.6829'}

 68%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍                                               | 179/263 [2:17:37<59:27, 42.47s/it]
 68%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉                                               | 180/263 [2:18:10<54:55, 39.70s/it]
                                                                                                                                                                                              
{'loss': '0.6847', 'grad_norm': '0.1948', 'learning_rate': '5.585e-05', 'ppl': '1.983', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '404.4', 'tokens/trainable': 2507016, 'tokens/total': 5440266, 'epoch': '0.6867'}

 68%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉                                               | 180/263 [2:18:10<54:55, 39.70s/it]
 69%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌                                              | 181/263 [2:18:44<51:47, 37.90s/it]
                                                                                                                                                                                              
{'loss': '0.6923', 'grad_norm': '0.2085', 'learning_rate': '5.466e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '35.58', 'memory/max_allocated (GiB)': '35.58', 'memory/device_reserved (GiB)': '44.79', 'tokens/train_per_sec_per_gpu': '367.1', 'tokens/trainable': 2519381, 'tokens/total': 5467466, 'epoch': '0.6905'}

 69%|██████████████████████████████████████████████████████████████████████████████████████████████████████▌                                              | 181/263 [2:18:44<51:47, 37.90s/it]
 69%|███████████████████████████████████████████████████████████████████████████████████████████████████████                                              | 182/263 [2:19:19<50:05, 37.10s/it]
                                                                                                                                                                                              
{'loss': '0.7096', 'grad_norm': '0.196', 'learning_rate': '5.348e-05', 'ppl': '2.033', 'memory/max_active (GiB)': '30.89', 'memory/max_allocated (GiB)': '30.89', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '379.8', 'tokens/trainable': 2532772, 'tokens/total': 5497208, 'epoch': '0.6943'}

 69%|███████████████████████████████████████████████████████████████████████████████████████████████████████                                              | 182/263 [2:19:19<50:05, 37.10s/it]
 70%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                             | 183/263 [2:19:53<48:13, 36.17s/it]
                                                                                                                                                                                              
{'loss': '0.676', 'grad_norm': '0.2051', 'learning_rate': '5.231e-05', 'ppl': '1.966', 'memory/max_active (GiB)': '24.11', 'memory/max_allocated (GiB)': '24.11', 'memory/device_reserved (GiB)': '38.66', 'tokens/train_per_sec_per_gpu': '345.2', 'tokens/trainable': 2544502, 'tokens/total': 5521580, 'epoch': '0.6981'}

 70%|███████████████████████████████████████████████████████████████████████████████████████████████████████▋                                             | 183/263 [2:19:53<48:13, 36.17s/it]
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 184/263 [2:20:43<52:59, 40.25s/it]
                                                                                                                                                                                              
{'loss': '0.6542', 'grad_norm': '0.2071', 'learning_rate': '5.115e-05', 'ppl': '1.924', 'memory/max_active (GiB)': '39.48', 'memory/max_allocated (GiB)': '39.48', 'memory/device_reserved (GiB)': '54.69', 'tokens/train_per_sec_per_gpu': '310.1', 'tokens/trainable': 2559938, 'tokens/total': 5553304, 'epoch': '0.702'}

 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 184/263 [2:20:43<52:59, 40.25s/it]
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                            | 185/263 [2:21:23<52:24, 40.31s/it]
                                                                                                                                                                                              
{'loss': '0.6775', 'grad_norm': '0.2571', 'learning_rate': '5e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '38.61', 'memory/max_allocated (GiB)': '38.61', 'memory/device_reserved (GiB)': '53.19', 'tokens/train_per_sec_per_gpu': '308.9', 'tokens/trainable': 2572432, 'tokens/total': 5582308, 'epoch': '0.7058'}

 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                            | 185/263 [2:21:23<52:24, 40.31s/it]
 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                           | 186/263 [2:22:13<55:23, 43.16s/it]
                                                                                                                                                                                              
{'loss': '0.6412', 'grad_norm': '0.2141', 'learning_rate': '4.886e-05', 'ppl': '1.899', 'memory/max_active (GiB)': '50.63', 'memory/max_allocated (GiB)': '50.63', 'memory/device_reserved (GiB)': '71.62', 'tokens/train_per_sec_per_gpu': '345.7', 'tokens/trainable': 2589650, 'tokens/total': 5622242, 'epoch': '0.7096'}

 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                           | 186/263 [2:22:13<55:23, 43.16s/it]
 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                           | 187/263 [2:22:59<55:44, 44.01s/it]
                                                                                                                                                                                              
{'loss': '0.6154', 'grad_norm': '0.1996', 'learning_rate': '4.772e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '33.16', 'memory/max_allocated (GiB)': '33.16', 'memory/device_reserved (GiB)': '70.77', 'tokens/train_per_sec_per_gpu': '289.1', 'tokens/trainable': 2602951, 'tokens/total': 5652818, 'epoch': '0.7134'}

 71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                           | 187/263 [2:22:59<55:44, 44.01s/it]
 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 188/263 [2:23:52<58:33, 46.84s/it]
                                                                                                                                                                                              
{'loss': '0.6231', 'grad_norm': '0.2165', 'learning_rate': '4.66e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '55.1', 'memory/max_allocated (GiB)': '55.1', 'memory/device_reserved (GiB)': '78.44', 'tokens/train_per_sec_per_gpu': '241.5', 'tokens/trainable': 2615857, 'tokens/total': 5682190, 'epoch': '0.7172'}

 71%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 188/263 [2:23:52<58:33, 46.84s/it]
 72%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 189/263 [2:24:51<1:02:07, 50.37s/it]
                                                                                                                                                                                              
{'loss': '0.644', 'grad_norm': '0.1827', 'learning_rate': '4.548e-05', 'ppl': '1.904', 'memory/max_active (GiB)': '40.42', 'memory/max_allocated (GiB)': '40.42', 'memory/device_reserved (GiB)': '56.19', 'tokens/train_per_sec_per_gpu': '308.8', 'tokens/trainable': 2633955, 'tokens/total': 5718572, 'epoch': '0.721'}

 72%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 189/263 [2:24:51<1:02:07, 50.37s/it]
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 190/263 [2:25:38<59:58, 49.29s/it]
                                                                                                                                                                                              
{'loss': '0.6435', 'grad_norm': '0.2112', 'learning_rate': '4.437e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '39.61', 'memory/max_allocated (GiB)': '39.61', 'memory/device_reserved (GiB)': '54.75', 'tokens/train_per_sec_per_gpu': '314.9', 'tokens/trainable': 2648683, 'tokens/total': 5751500, 'epoch': '0.7248'}

 72%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                         | 190/263 [2:25:38<59:58, 49.29s/it]
 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                        | 191/263 [2:26:29<59:50, 49.87s/it]
                                                                                                                                                                                              
{'loss': '0.6434', 'grad_norm': '0.1974', 'learning_rate': '4.328e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '48.38', 'memory/max_allocated (GiB)': '48.38', 'memory/device_reserved (GiB)': '68.25', 'tokens/train_per_sec_per_gpu': '370.3', 'tokens/trainable': 2667651, 'tokens/total': 5795452, 'epoch': '0.7287'}

 73%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                        | 191/263 [2:26:29<59:50, 49.87s/it]
 73%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                       | 192/263 [2:27:33<1:04:07, 54.19s/it]
                                                                                                                                                                                              
{'loss': '0.7199', 'grad_norm': '0.1943', 'learning_rate': '4.219e-05', 'ppl': '2.054', 'memory/max_active (GiB)': '51.55', 'memory/max_allocated (GiB)': '51.55', 'memory/device_reserved (GiB)': '72.91', 'tokens/train_per_sec_per_gpu': '241.4', 'tokens/trainable': 2683164, 'tokens/total': 5831970, 'epoch': '0.7325'}

 73%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                       | 192/263 [2:27:33<1:04:07, 54.19s/it]
 73%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                       | 193/263 [2:28:31<1:04:25, 55.22s/it]
                                                                                                                                                                                              
{'loss': '0.6151', 'grad_norm': '0.212', 'learning_rate': '4.111e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '45.42', 'memory/max_allocated (GiB)': '45.42', 'memory/device_reserved (GiB)': '63.73', 'tokens/train_per_sec_per_gpu': '237.2', 'tokens/trainable': 2696830, 'tokens/total': 5864046, 'epoch': '0.7363'}

 73%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                       | 193/263 [2:28:31<1:04:25, 55.22s/it]
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                       | 194/263 [2:29:15<59:32, 51.78s/it]
                                                                                                                                                                                              
{'loss': '0.662', 'grad_norm': '0.1948', 'learning_rate': '4.005e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '36.93', 'memory/max_allocated (GiB)': '36.93', 'memory/device_reserved (GiB)': '46.86', 'tokens/train_per_sec_per_gpu': '370.9', 'tokens/trainable': 2713056, 'tokens/total': 5897816, 'epoch': '0.7401'}

 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                       | 194/263 [2:29:15<59:32, 51.78s/it]
 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 195/263 [2:29:58<55:48, 49.24s/it]
                                                                                                                                                                                              
{'loss': '0.7405', 'grad_norm': '0.2793', 'learning_rate': '3.899e-05', 'ppl': '2.097', 'memory/max_active (GiB)': '25.49', 'memory/max_allocated (GiB)': '25.49', 'memory/device_reserved (GiB)': '32.93', 'tokens/train_per_sec_per_gpu': '231.7', 'tokens/trainable': 2723094, 'tokens/total': 5917064, 'epoch': '0.7439'}

 74%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 195/263 [2:29:58<55:48, 49.24s/it]
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████                                      | 196/263 [2:30:43<53:30, 47.92s/it]
                                                                                                                                                                                              
{'loss': '0.6287', 'grad_norm': '0.2298', 'learning_rate': '3.795e-05', 'ppl': '1.875', 'memory/max_active (GiB)': '21.3', 'memory/max_allocated (GiB)': '21.3', 'memory/device_reserved (GiB)': '26.55', 'tokens/train_per_sec_per_gpu': '184.1', 'tokens/trainable': 2731350, 'tokens/total': 5936682, 'epoch': '0.7477'}

 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████                                      | 196/263 [2:30:43<53:30, 47.92s/it]
 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 197/263 [2:31:37<54:37, 49.66s/it]
                                                                                                                                                                                              
{'loss': '0.6935', 'grad_norm': '0.2097', 'learning_rate': '3.691e-05', 'ppl': '2.001', 'memory/max_active (GiB)': '30.31', 'memory/max_allocated (GiB)': '30.31', 'memory/device_reserved (GiB)': '40.12', 'tokens/train_per_sec_per_gpu': '254.8', 'tokens/trainable': 2745037, 'tokens/total': 5968342, 'epoch': '0.7515'}

 75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 197/263 [2:31:37<54:37, 49.66s/it]
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                    | 198/263 [2:32:29<54:33, 50.36s/it]
                                                                                                                                                                                              
{'loss': '0.6516', 'grad_norm': '0.2135', 'learning_rate': '3.589e-05', 'ppl': '1.919', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.11', 'tokens/train_per_sec_per_gpu': '247.8', 'tokens/trainable': 2757923, 'tokens/total': 5997292, 'epoch': '0.7554'}

 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                    | 198/263 [2:32:29<54:33, 50.36s/it]
 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                    | 199/263 [2:33:29<56:57, 53.40s/it]
                                                                                                                                                                                              
{'loss': '0.6918', 'grad_norm': '0.1828', 'learning_rate': '3.488e-05', 'ppl': '1.997', 'memory/max_active (GiB)': '56.91', 'memory/max_allocated (GiB)': '56.91', 'memory/device_reserved (GiB)': '73.99', 'tokens/train_per_sec_per_gpu': '302.4', 'tokens/trainable': 2776213, 'tokens/total': 6038550, 'epoch': '0.7592'}

 76%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                    | 199/263 [2:33:29<56:57, 53.40s/it]
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 200/263 [2:34:22<56:02, 53.37s/it]
                                                                                                                                                                                              
{'loss': '0.6774', 'grad_norm': '0.2068', 'learning_rate': '3.388e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '43.54', 'memory/max_allocated (GiB)': '43.54', 'memory/device_reserved (GiB)': '60.76', 'tokens/train_per_sec_per_gpu': '271.4', 'tokens/trainable': 2790680, 'tokens/total': 6069138, 'epoch': '0.763'}

 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                   | 200/263 [2:34:22<56:02, 53.37s/it]
 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 201/263 [2:35:22<57:08, 55.30s/it]
                                                                                                                                                                                              
{'loss': '0.6853', 'grad_norm': '0.2284', 'learning_rate': '3.289e-05', 'ppl': '1.984', 'memory/max_active (GiB)': '55.21', 'memory/max_allocated (GiB)': '55.21', 'memory/device_reserved (GiB)': '78.48', 'tokens/train_per_sec_per_gpu': '256.8', 'tokens/trainable': 2806039, 'tokens/total': 6104430, 'epoch': '0.7668'}

 76%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                   | 201/263 [2:35:22<57:08, 55.30s/it]
 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 202/263 [2:35:58<50:17, 49.47s/it]
                                                                                                                                                                                              
{'loss': '0.7109', 'grad_norm': '0.2175', 'learning_rate': '3.191e-05', 'ppl': '2.036', 'memory/max_active (GiB)': '23.78', 'memory/max_allocated (GiB)': '23.78', 'memory/device_reserved (GiB)': '30.28', 'tokens/train_per_sec_per_gpu': '336.1', 'tokens/trainable': 2818090, 'tokens/total': 6127762, 'epoch': '0.7706'}

 77%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                  | 202/263 [2:35:58<50:17, 49.47s/it]
 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                  | 203/263 [2:36:28<43:45, 43.75s/it]
                                                                                                                                                                                              
{'loss': '0.6929', 'grad_norm': '0.2504', 'learning_rate': '3.095e-05', 'ppl': '2', 'memory/max_active (GiB)': '27.24', 'memory/max_allocated (GiB)': '27.24', 'memory/device_reserved (GiB)': '30.67', 'tokens/train_per_sec_per_gpu': '347.7', 'tokens/trainable': 2828664, 'tokens/total': 6151536, 'epoch': '0.7744'}

 77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                  | 203/263 [2:36:28<43:45, 43.75s/it]
 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                 | 204/263 [2:37:16<44:12, 44.96s/it]
                                                                                                                                                                                              
{'loss': '0.6381', 'grad_norm': '0.1846', 'learning_rate': '3e-05', 'ppl': '1.893', 'memory/max_active (GiB)': '35.73', 'memory/max_allocated (GiB)': '35.73', 'memory/device_reserved (GiB)': '45.1', 'tokens/train_per_sec_per_gpu': '335.6', 'tokens/trainable': 2844697, 'tokens/total': 6186540, 'epoch': '0.7783'}

 78%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                 | 204/263 [2:37:16<44:12, 44.96s/it]
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 205/263 [2:38:09<45:46, 47.35s/it]
                                                                                                                                                                                              
{'loss': '0.578', 'grad_norm': '0.1669', 'learning_rate': '2.906e-05', 'ppl': '1.783', 'memory/max_active (GiB)': '38.6', 'memory/max_allocated (GiB)': '38.6', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '300.8', 'tokens/trainable': 2860615, 'tokens/total': 6225342, 'epoch': '0.7821'}

 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 205/263 [2:38:09<45:46, 47.35s/it]
 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                | 206/263 [2:38:55<44:36, 46.95s/it]
                                                                                                                                                                                              
{'loss': '0.7123', 'grad_norm': '0.2747', 'learning_rate': '2.813e-05', 'ppl': '2.039', 'memory/max_active (GiB)': '33.82', 'memory/max_allocated (GiB)': '33.82', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '214.9', 'tokens/trainable': 2870504, 'tokens/total': 6249862, 'epoch': '0.7859'}

 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                | 206/263 [2:38:55<44:36, 46.95s/it]
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                               | 207/263 [2:39:38<42:35, 45.64s/it]
                                                                                                                                                                                              
{'loss': '0.5941', 'grad_norm': '0.198', 'learning_rate': '2.721e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '42.35', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 2883356, 'tokens/total': 6278118, 'epoch': '0.7897'}

 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                               | 207/263 [2:39:38<42:35, 45.64s/it]
 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                               | 208/263 [2:40:17<40:11, 43.85s/it]
                                                                                                                                                                                              
{'loss': '0.7291', 'grad_norm': '0.1958', 'learning_rate': '2.631e-05', 'ppl': '2.073', 'memory/max_active (GiB)': '29.79', 'memory/max_allocated (GiB)': '29.79', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '300.4', 'tokens/trainable': 2895280, 'tokens/total': 6304026, 'epoch': '0.7935'}

 79%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                               | 208/263 [2:40:17<40:11, 43.85s/it]
 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                              | 209/263 [2:40:56<38:06, 42.35s/it]
                                                                                                                                                                                              
{'loss': '0.6712', 'grad_norm': '0.209', 'learning_rate': '2.542e-05', 'ppl': '1.957', 'memory/max_active (GiB)': '28.91', 'memory/max_allocated (GiB)': '28.91', 'memory/device_reserved (GiB)': '37.99', 'tokens/train_per_sec_per_gpu': '305.4', 'tokens/trainable': 2907146, 'tokens/total': 6329922, 'epoch': '0.7973'}

 79%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                              | 209/263 [2:40:56<38:06, 42.35s/it]
 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                              | 210/263 [2:41:38<37:14, 42.17s/it]
                                                                                                                                                                                              
{'loss': '0.6474', 'grad_norm': '0.2265', 'learning_rate': '2.454e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '33.6', 'memory/max_allocated (GiB)': '33.6', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '257.8', 'tokens/trainable': 2917904, 'tokens/total': 6354510, 'epoch': '0.8011'}

 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                              | 210/263 [2:41:38<37:14, 42.17s/it]
 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                             | 211/263 [2:42:15<35:18, 40.74s/it]
                                                                                                                                                                                              
{'loss': '0.6219', 'grad_norm': '0.1866', 'learning_rate': '2.368e-05', 'ppl': '1.863', 'memory/max_active (GiB)': '23.98', 'memory/max_allocated (GiB)': '23.98', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 2929254, 'tokens/total': 6379050, 'epoch': '0.805'}

 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                             | 211/263 [2:42:15<35:18, 40.74s/it]
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                             | 212/263 [2:43:06<37:06, 43.66s/it]
                                                                                                                                                                                              
{'loss': '0.5683', 'grad_norm': '0.1713', 'learning_rate': '2.283e-05', 'ppl': '1.765', 'memory/max_active (GiB)': '36.5', 'memory/max_allocated (GiB)': '36.5', 'memory/device_reserved (GiB)': '46.02', 'tokens/train_per_sec_per_gpu': '342.5', 'tokens/trainable': 2946538, 'tokens/total': 6416330, 'epoch': '0.8088'}

 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                             | 212/263 [2:43:06<37:06, 43.66s/it]
 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                            | 213/263 [2:44:05<40:19, 48.39s/it]
                                                                                                                                                                                              
{'loss': '0.6341', 'grad_norm': '0.1864', 'learning_rate': '2.199e-05', 'ppl': '1.885', 'memory/max_active (GiB)': '52.19', 'memory/max_allocated (GiB)': '52.19', 'memory/device_reserved (GiB)': '74.01', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 2962875, 'tokens/total': 6452688, 'epoch': '0.8126'}

 81%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                            | 213/263 [2:44:05<40:19, 48.39s/it]
 81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 214/263 [2:44:59<40:41, 49.83s/it]
                                                                                                                                                                                              
{'loss': '0.6799', 'grad_norm': '0.2418', 'learning_rate': '2.117e-05', 'ppl': '1.974', 'memory/max_active (GiB)': '33.52', 'memory/max_allocated (GiB)': '33.52', 'memory/device_reserved (GiB)': '42', 'tokens/train_per_sec_per_gpu': '259.6', 'tokens/trainable': 2976675, 'tokens/total': 6485638, 'epoch': '0.8164'}

 81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                           | 214/263 [2:44:59<40:41, 49.83s/it]
 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                           | 215/263 [2:45:35<36:33, 45.70s/it]
                                                                                                                                                                                              
{'loss': '0.5868', 'grad_norm': '0.2032', 'learning_rate': '2.036e-05', 'ppl': '1.798', 'memory/max_active (GiB)': '24.76', 'memory/max_allocated (GiB)': '24.76', 'memory/device_reserved (GiB)': '31.8', 'tokens/train_per_sec_per_gpu': '283.4', 'tokens/trainable': 2986893, 'tokens/total': 6509020, 'epoch': '0.8202'}

 82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                           | 215/263 [2:45:35<36:33, 45.70s/it]
 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                          | 216/263 [2:46:18<35:20, 45.12s/it]
                                                                                                                                                                                              
{'loss': '0.7066', 'grad_norm': '0.2272', 'learning_rate': '1.957e-05', 'ppl': '2.027', 'memory/max_active (GiB)': '34.06', 'memory/max_allocated (GiB)': '34.06', 'memory/device_reserved (GiB)': '42.78', 'tokens/train_per_sec_per_gpu': '281.2', 'tokens/trainable': 2999202, 'tokens/total': 6537048, 'epoch': '0.824'}

 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                          | 216/263 [2:46:18<35:20, 45.12s/it]
 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                          | 217/263 [2:46:58<33:17, 43.41s/it]
                                                                                                                                                                                              
{'loss': '0.6148', 'grad_norm': '0.1853', 'learning_rate': '1.879e-05', 'ppl': '1.849', 'memory/max_active (GiB)': '27.75', 'memory/max_allocated (GiB)': '27.75', 'memory/device_reserved (GiB)': '36.31', 'tokens/train_per_sec_per_gpu': '309.6', 'tokens/trainable': 3011411, 'tokens/total': 6563400, 'epoch': '0.8278'}

 83%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                          | 217/263 [2:46:58<33:17, 43.41s/it]
 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                         | 218/263 [2:47:46<33:43, 44.97s/it]
                                                                                                                                                                                              
{'loss': '0.6522', 'grad_norm': '0.1957', 'learning_rate': '1.802e-05', 'ppl': '1.92', 'memory/max_active (GiB)': '31.19', 'memory/max_allocated (GiB)': '31.19', 'memory/device_reserved (GiB)': '41.37', 'tokens/train_per_sec_per_gpu': '279.6', 'tokens/trainable': 3024998, 'tokens/total': 6593488, 'epoch': '0.8317'}

 83%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                         | 218/263 [2:47:46<33:43, 44.97s/it]
 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                         | 219/263 [2:48:33<33:24, 45.57s/it]
                                                                                                                                                                                              
{'loss': '0.6208', 'grad_norm': '0.1843', 'learning_rate': '1.727e-05', 'ppl': '1.86', 'memory/max_active (GiB)': '30.24', 'memory/max_allocated (GiB)': '30.24', 'memory/device_reserved (GiB)': '39.91', 'tokens/train_per_sec_per_gpu': '332.7', 'tokens/trainable': 3040622, 'tokens/total': 6624140, 'epoch': '0.8355'}

 83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                         | 219/263 [2:48:33<33:24, 45.57s/it]
 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                        | 220/263 [2:49:22<33:25, 46.63s/it]
                                                                                                                                                                                              
{'loss': '0.6097', 'grad_norm': '0.1874', 'learning_rate': '1.653e-05', 'ppl': '1.84', 'memory/max_active (GiB)': '37.68', 'memory/max_allocated (GiB)': '37.68', 'memory/device_reserved (GiB)': '47.76', 'tokens/train_per_sec_per_gpu': '277.7', 'tokens/trainable': 3054263, 'tokens/total': 6654518, 'epoch': '0.8393'}

 84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                        | 220/263 [2:49:22<33:25, 46.63s/it]
 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                       | 221/263 [2:50:14<33:34, 47.97s/it]
                                                                                                                                                                                              
{'loss': '0.6886', 'grad_norm': '0.2007', 'learning_rate': '1.581e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '29.37', 'memory/max_allocated (GiB)': '29.37', 'memory/device_reserved (GiB)': '38.64', 'tokens/train_per_sec_per_gpu': '249.6', 'tokens/trainable': 3067017, 'tokens/total': 6683614, 'epoch': '0.8431'}

 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                       | 221/263 [2:50:14<33:34, 47.97s/it]
 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                       | 222/263 [2:51:01<32:41, 47.83s/it]
                                                                                                                                                                                              
{'loss': '0.629', 'grad_norm': '0.2061', 'learning_rate': '1.51e-05', 'ppl': '1.876', 'memory/max_active (GiB)': '47.52', 'memory/max_allocated (GiB)': '47.52', 'memory/device_reserved (GiB)': '66.84', 'tokens/train_per_sec_per_gpu': '305', 'tokens/trainable': 3081510, 'tokens/total': 6718874, 'epoch': '0.8469'}

 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                       | 222/263 [2:51:01<32:41, 47.83s/it]
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                      | 223/263 [2:51:45<31:07, 46.68s/it]
                                                                                                                                                                                              
{'loss': '0.6047', 'grad_norm': '0.1965', 'learning_rate': '1.441e-05', 'ppl': '1.831', 'memory/max_active (GiB)': '37.79', 'memory/max_allocated (GiB)': '37.79', 'memory/device_reserved (GiB)': '47.88', 'tokens/train_per_sec_per_gpu': '297.3', 'tokens/trainable': 3094585, 'tokens/total': 6747542, 'epoch': '0.8507'}

 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                      | 223/263 [2:51:45<31:07, 46.68s/it]
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                      | 224/263 [2:52:53<34:29, 53.05s/it]
                                                                                                                                                                                              
{'loss': '0.6172', 'grad_norm': '0.1745', 'learning_rate': '1.373e-05', 'ppl': '1.854', 'memory/max_active (GiB)': '48.12', 'memory/max_allocated (GiB)': '48.12', 'memory/device_reserved (GiB)': '67.76', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 3114552, 'tokens/total': 6793276, 'epoch': '0.8546'}

 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                      | 224/263 [2:52:53<34:29, 53.05s/it]
 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                     | 225/263 [2:53:48<33:52, 53.49s/it]
                                                                                                                                                                                              
{'loss': '0.6193', 'grad_norm': '0.2026', 'learning_rate': '1.307e-05', 'ppl': '1.858', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.99', 'tokens/train_per_sec_per_gpu': '257.3', 'tokens/trainable': 3128578, 'tokens/total': 6823484, 'epoch': '0.8584'}

 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                     | 225/263 [2:53:48<33:52, 53.49s/it]
 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                     | 226/263 [2:54:34<31:37, 51.28s/it]
                                                                                                                                                                                              
{'loss': '0.6827', 'grad_norm': '0.2373', 'learning_rate': '1.242e-05', 'ppl': '1.979', 'memory/max_active (GiB)': '39.13', 'memory/max_allocated (GiB)': '39.13', 'memory/device_reserved (GiB)': '54.05', 'tokens/train_per_sec_per_gpu': '285.3', 'tokens/trainable': 3141735, 'tokens/total': 6855924, 'epoch': '0.8622'}

 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                     | 226/263 [2:54:34<31:37, 51.28s/it]
 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                    | 227/263 [2:55:15<29:03, 48.44s/it]
                                                                                                                                                                                              
{'loss': '0.6604', 'grad_norm': '0.1903', 'learning_rate': '1.179e-05', 'ppl': '1.935', 'memory/max_active (GiB)': '30.3', 'memory/max_allocated (GiB)': '30.3', 'memory/device_reserved (GiB)': '40.04', 'tokens/train_per_sec_per_gpu': '315.4', 'tokens/trainable': 3154923, 'tokens/total': 6884738, 'epoch': '0.866'}

 86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                    | 227/263 [2:55:15<29:03, 48.44s/it]
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                   | 228/263 [2:55:58<27:15, 46.74s/it]
                                                                                                                                                                                              
{'loss': '0.6514', 'grad_norm': '0.2107', 'learning_rate': '1.117e-05', 'ppl': '1.918', 'memory/max_active (GiB)': '22.9', 'memory/max_allocated (GiB)': '22.9', 'memory/device_reserved (GiB)': '34.14', 'tokens/train_per_sec_per_gpu': '252.7', 'tokens/trainable': 3165734, 'tokens/total': 6907742, 'epoch': '0.8698'}

 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                   | 228/263 [2:55:58<27:15, 46.74s/it]
 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 229/263 [2:56:54<27:58, 49.37s/it]
                                                                                                                                                                                              
{'loss': '0.6451', 'grad_norm': '0.1984', 'learning_rate': '1.057e-05', 'ppl': '1.906', 'memory/max_active (GiB)': '43.49', 'memory/max_allocated (GiB)': '43.49', 'memory/device_reserved (GiB)': '60.71', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 3180849, 'tokens/total': 6937996, 'epoch': '0.8736'}

 87%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 229/263 [2:56:54<27:58, 49.37s/it]
 87%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                  | 230/263 [2:57:41<26:45, 48.65s/it]
                                                                                                                                                                                              
{'loss': '0.6366', 'grad_norm': '0.1864', 'learning_rate': '9.985e-06', 'ppl': '1.89', 'memory/max_active (GiB)': '35.46', 'memory/max_allocated (GiB)': '35.46', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '330', 'tokens/trainable': 3196343, 'tokens/total': 6970136, 'epoch': '0.8774'}

 87%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                  | 230/263 [2:57:41<26:45, 48.65s/it]
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                  | 231/263 [2:58:23<24:51, 46.60s/it]
                                                                                                                                                                                              
{'loss': '0.6353', 'grad_norm': '0.1761', 'learning_rate': '9.416e-06', 'ppl': '1.888', 'memory/max_active (GiB)': '34.63', 'memory/max_allocated (GiB)': '34.63', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '386.8', 'tokens/trainable': 3212523, 'tokens/total': 7006106, 'epoch': '0.8813'}

 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                  | 231/263 [2:58:23<24:51, 46.60s/it]
 88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                 | 232/263 [2:59:15<24:55, 48.25s/it]
                                                                                                                                                                                              
{'loss': '0.5426', 'grad_norm': '0.1728', 'learning_rate': '8.862e-06', 'ppl': '1.721', 'memory/max_active (GiB)': '38.02', 'memory/max_allocated (GiB)': '38.02', 'memory/device_reserved (GiB)': '48.33', 'tokens/train_per_sec_per_gpu': '275.8', 'tokens/trainable': 3226889, 'tokens/total': 7039494, 'epoch': '0.8851'}

 88%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                 | 232/263 [2:59:15<24:55, 48.25s/it]
 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                 | 233/263 [2:59:59<23:35, 47.20s/it]
                                                                                                                                                                                              
{'loss': '0.6979', 'grad_norm': '0.2131', 'learning_rate': '8.325e-06', 'ppl': '2.01', 'memory/max_active (GiB)': '35.02', 'memory/max_allocated (GiB)': '35.02', 'memory/device_reserved (GiB)': '44.24', 'tokens/train_per_sec_per_gpu': '322.7', 'tokens/trainable': 3241328, 'tokens/total': 7070424, 'epoch': '0.8889'}

 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                 | 233/263 [2:59:59<23:35, 47.20s/it]
 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 234/263 [3:00:41<22:00, 45.55s/it]
                                                                                                                                                                                              
{'loss': '0.6724', 'grad_norm': '0.2417', 'learning_rate': '7.803e-06', 'ppl': '1.959', 'memory/max_active (GiB)': '28.09', 'memory/max_allocated (GiB)': '28.09', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 3251965, 'tokens/total': 7093680, 'epoch': '0.8927'}

 89%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 234/263 [3:00:41<22:00, 45.55s/it]
 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏               | 235/263 [3:01:27<21:14, 45.53s/it]
                                                                                                                                                                                              
{'loss': '0.6359', 'grad_norm': '0.1907', 'learning_rate': '7.298e-06', 'ppl': '1.889', 'memory/max_active (GiB)': '36.62', 'memory/max_allocated (GiB)': '36.62', 'memory/device_reserved (GiB)': '46.28', 'tokens/train_per_sec_per_gpu': '325.9', 'tokens/trainable': 3266781, 'tokens/total': 7123902, 'epoch': '0.8965'}

 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏               | 235/263 [3:01:27<21:14, 45.53s/it]
 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋               | 236/263 [3:02:15<20:50, 46.32s/it]
                                                                                                                                                                                              
{'loss': '0.6866', 'grad_norm': '0.202', 'learning_rate': '6.809e-06', 'ppl': '1.987', 'memory/max_active (GiB)': '38.62', 'memory/max_allocated (GiB)': '38.62', 'memory/device_reserved (GiB)': '53.25', 'tokens/train_per_sec_per_gpu': '298.6', 'tokens/trainable': 3281164, 'tokens/total': 7156358, 'epoch': '0.9003'}

 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋               | 236/263 [3:02:15<20:50, 46.32s/it]
 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎              | 237/263 [3:02:56<19:22, 44.71s/it]
                                                                                                                                                                                              
{'loss': '0.6841', 'grad_norm': '0.2104', 'learning_rate': '6.337e-06', 'ppl': '1.982', 'memory/max_active (GiB)': '36.98', 'memory/max_allocated (GiB)': '36.98', 'memory/device_reserved (GiB)': '46.64', 'tokens/train_per_sec_per_gpu': '374.6', 'tokens/trainable': 3296514, 'tokens/total': 7188274, 'epoch': '0.9041'}

 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎              | 237/263 [3:02:56<19:22, 44.71s/it]
 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊              | 238/263 [3:03:46<19:19, 46.39s/it]
                                                                                                                                                                                              
{'loss': '0.6678', 'grad_norm': '0.1991', 'learning_rate': '5.881e-06', 'ppl': '1.95', 'memory/max_active (GiB)': '58.32', 'memory/max_allocated (GiB)': '58.32', 'memory/device_reserved (GiB)': '75.86', 'tokens/train_per_sec_per_gpu': '327.5', 'tokens/trainable': 3312986, 'tokens/total': 7224448, 'epoch': '0.908'}

 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊              | 238/263 [3:03:46<19:19, 46.39s/it]
 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍             | 239/263 [3:04:33<18:35, 46.48s/it]
                                                                                                                                                                                              
{'loss': '0.6174', 'grad_norm': '0.1702', 'learning_rate': '5.441e-06', 'ppl': '1.854', 'memory/max_active (GiB)': '31.28', 'memory/max_allocated (GiB)': '31.28', 'memory/device_reserved (GiB)': '41.57', 'tokens/train_per_sec_per_gpu': '353.2', 'tokens/trainable': 3329480, 'tokens/total': 7253234, 'epoch': '0.9118'}

 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍             | 239/263 [3:04:33<18:35, 46.48s/it]
 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 240/263 [3:05:26<18:37, 48.60s/it]
                                                                                                                                                                                              
{'loss': '0.6532', 'grad_norm': '0.1679', 'learning_rate': '5.018e-06', 'ppl': '1.922', 'memory/max_active (GiB)': '34.53', 'memory/max_allocated (GiB)': '34.53', 'memory/device_reserved (GiB)': '43.5', 'tokens/train_per_sec_per_gpu': '319.1', 'tokens/trainable': 3346561, 'tokens/total': 7292512, 'epoch': '0.9156'}

 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 240/263 [3:05:26<18:37, 48.60s/it]
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌            | 241/263 [3:06:09<17:10, 46.84s/it]
                                                                                                                                                                                              
{'loss': '0.6461', 'grad_norm': '0.2172', 'learning_rate': '4.612e-06', 'ppl': '1.908', 'memory/max_active (GiB)': '30.52', 'memory/max_allocated (GiB)': '30.52', 'memory/device_reserved (GiB)': '40.36', 'tokens/train_per_sec_per_gpu': '270.1', 'tokens/trainable': 3358110, 'tokens/total': 7319476, 'epoch': '0.9194'}

 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌            | 241/263 [3:06:09<17:10, 46.84s/it]
 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████            | 242/263 [3:07:13<18:14, 52.12s/it]
                                                                                                                                                                                              
{'loss': '0.6017', 'grad_norm': '0.1813', 'learning_rate': '4.222e-06', 'ppl': '1.825', 'memory/max_active (GiB)': '46.95', 'memory/max_allocated (GiB)': '46.95', 'memory/device_reserved (GiB)': '66.02', 'tokens/train_per_sec_per_gpu': '347.9', 'tokens/trainable': 3380518, 'tokens/total': 7369626, 'epoch': '0.9232'}

 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████            | 242/263 [3:07:13<18:14, 52.12s/it]
 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 243/263 [3:08:11<17:54, 53.72s/it]
                                                                                                                                                                                              
{'loss': '0.5814', 'grad_norm': '0.1749', 'learning_rate': '3.85e-06', 'ppl': '1.789', 'memory/max_active (GiB)': '39.53', 'memory/max_allocated (GiB)': '39.53', 'memory/device_reserved (GiB)': '58.11', 'tokens/train_per_sec_per_gpu': '306.3', 'tokens/trainable': 3398118, 'tokens/total': 7406506, 'epoch': '0.927'}

 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋           | 243/263 [3:08:11<17:54, 53.72s/it]
 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 244/263 [3:08:58<16:25, 51.87s/it]
                                                                                                                                                                                              
{'loss': '0.632', 'grad_norm': '0.1955', 'learning_rate': '3.494e-06', 'ppl': '1.881', 'memory/max_active (GiB)': '49.89', 'memory/max_allocated (GiB)': '49.89', 'memory/device_reserved (GiB)': '70.63', 'tokens/train_per_sec_per_gpu': '358.4', 'tokens/trainable': 3415167, 'tokens/total': 7447132, 'epoch': '0.9309'}

 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏          | 244/263 [3:08:58<16:25, 51.87s/it]
 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊          | 245/263 [3:09:42<14:46, 49.25s/it]
                                                                                                                                                                                              
{'loss': '0.683', 'grad_norm': '0.2159', 'learning_rate': '3.155e-06', 'ppl': '1.98', 'memory/max_active (GiB)': '31.1', 'memory/max_allocated (GiB)': '31.1', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 3428390, 'tokens/total': 7476600, 'epoch': '0.9347'}

 93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊          | 245/263 [3:09:42<14:46, 49.25s/it]
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎         | 246/263 [3:11:01<16:32, 58.35s/it]
                                                                                                                                                                                              
{'loss': '0.6627', 'grad_norm': '0.1617', 'learning_rate': '2.833e-06', 'ppl': '1.94', 'memory/max_active (GiB)': '57.27', 'memory/max_allocated (GiB)': '57.27', 'memory/device_reserved (GiB)': '74.5', 'tokens/train_per_sec_per_gpu': '342.7', 'tokens/trainable': 3455668, 'tokens/total': 7533052, 'epoch': '0.9385'}

 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎         | 246/263 [3:11:01<16:32, 58.35s/it]
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉         | 247/263 [3:11:31<13:18, 49.90s/it]
                                                                                                                                                                                              
{'loss': '0.6176', 'grad_norm': '0.1991', 'learning_rate': '2.528e-06', 'ppl': '1.855', 'memory/max_active (GiB)': '27.2', 'memory/max_allocated (GiB)': '27.2', 'memory/device_reserved (GiB)': '35.37', 'tokens/train_per_sec_per_gpu': '385.7', 'tokens/trainable': 3467302, 'tokens/total': 7557486, 'epoch': '0.9423'}

 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉         | 247/263 [3:11:31<13:18, 49.90s/it]
 94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌        | 248/263 [3:12:15<11:59, 47.94s/it]
                                                                                                                                                                                              
{'loss': '0.6939', 'grad_norm': '0.2216', 'learning_rate': '2.241e-06', 'ppl': '2.002', 'memory/max_active (GiB)': '44.94', 'memory/max_allocated (GiB)': '44.94', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '307.3', 'tokens/trainable': 3480626, 'tokens/total': 7588958, 'epoch': '0.9461'}

 94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌        | 248/263 [3:12:15<11:59, 47.94s/it]
 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████        | 249/263 [3:13:08<11:32, 49.44s/it]
                                                                                                                                                                                              
{'loss': '0.6791', 'grad_norm': '0.2064', 'learning_rate': '1.97e-06', 'ppl': '1.972', 'memory/max_active (GiB)': '39.77', 'memory/max_allocated (GiB)': '39.77', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '323.8', 'tokens/trainable': 3497767, 'tokens/total': 7623894, 'epoch': '0.9499'}

 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████        | 249/263 [3:13:08<11:32, 49.44s/it]
 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋       | 250/263 [3:14:15<11:50, 54.69s/it]
                                                                                                                                                                                              
{'loss': '0.6591', 'grad_norm': '0.2037', 'learning_rate': '1.717e-06', 'ppl': '1.933', 'memory/max_active (GiB)': '55.75', 'memory/max_allocated (GiB)': '55.75', 'memory/device_reserved (GiB)': '72.33', 'tokens/train_per_sec_per_gpu': '270.7', 'tokens/trainable': 3515893, 'tokens/total': 7665338, 'epoch': '0.9537'}

 95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋       | 250/263 [3:14:15<11:50, 54.69s/it]
 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏      | 251/263 [3:14:50<09:48, 49.05s/it]
                                                                                                                                                                                              
{'loss': '0.6294', 'grad_norm': '0.2187', 'learning_rate': '1.481e-06', 'ppl': '1.876', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '32.43', 'tokens/train_per_sec_per_gpu': '270.5', 'tokens/trainable': 3525599, 'tokens/total': 7687250, 'epoch': '0.9576'}

 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏      | 251/263 [3:14:50<09:48, 49.05s/it]
 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊      | 252/263 [3:15:41<09:04, 49.48s/it]
                                                                                                                                                                                              
{'loss': '0.545', 'grad_norm': '0.1694', 'learning_rate': '1.262e-06', 'ppl': '1.725', 'memory/max_active (GiB)': '48.1', 'memory/max_allocated (GiB)': '48.1', 'memory/device_reserved (GiB)': '67.93', 'tokens/train_per_sec_per_gpu': '384.7', 'tokens/trainable': 3545027, 'tokens/total': 7722636, 'epoch': '0.9614'}

 96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊      | 252/263 [3:15:41<09:04, 49.48s/it]
 96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎     | 253/263 [3:16:51<09:16, 55.62s/it]
                                                                                                                                                                                              
{'loss': '0.5997', 'grad_norm': '0.1826', 'learning_rate': '1.061e-06', 'ppl': '1.822', 'memory/max_active (GiB)': '50.46', 'memory/max_allocated (GiB)': '50.46', 'memory/device_reserved (GiB)': '71.27', 'tokens/train_per_sec_per_gpu': '251.7', 'tokens/trainable': 3562629, 'tokens/total': 7766154, 'epoch': '0.9652'}

 96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎     | 253/263 [3:16:51<09:16, 55.62s/it]
 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉     | 254/263 [3:17:55<08:42, 58.06s/it]
                                                                                                                                                                                              
{'loss': '0.6639', 'grad_norm': '0.187', 'learning_rate': '8.773e-07', 'ppl': '1.942', 'memory/max_active (GiB)': '51.43', 'memory/max_allocated (GiB)': '51.43', 'memory/device_reserved (GiB)': '72.87', 'tokens/train_per_sec_per_gpu': '309.8', 'tokens/trainable': 3582377, 'tokens/total': 7807646, 'epoch': '0.969'}

 97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉     | 254/263 [3:17:55<08:42, 58.06s/it]
 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 255/263 [3:18:36<07:04, 53.09s/it]
                                                                                                                                                                                              
{'loss': '0.7493', 'grad_norm': '0.2242', 'learning_rate': '7.108e-07', 'ppl': '2.116', 'memory/max_active (GiB)': '34.97', 'memory/max_allocated (GiB)': '34.97', 'memory/device_reserved (GiB)': '44.01', 'tokens/train_per_sec_per_gpu': '305.2', 'tokens/trainable': 3595039, 'tokens/total': 7833474, 'epoch': '0.9728'}

 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 255/263 [3:18:36<07:04, 53.09s/it]
 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████    | 256/263 [3:19:23<05:58, 51.28s/it]
                                                                                                                                                                                              
{'loss': '0.6764', 'grad_norm': '0.2359', 'learning_rate': '5.618e-07', 'ppl': '1.967', 'memory/max_active (GiB)': '21.36', 'memory/max_allocated (GiB)': '21.36', 'memory/device_reserved (GiB)': '26.67', 'tokens/train_per_sec_per_gpu': '177.1', 'tokens/trainable': 3603378, 'tokens/total': 7853228, 'epoch': '0.9766'}

 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████    | 256/263 [3:19:23<05:58, 51.28s/it]
 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌   | 257/263 [3:20:16<05:10, 51.73s/it]
                                                                                                                                                                                              
{'loss': '0.5852', 'grad_norm': '0.1929', 'learning_rate': '4.302e-07', 'ppl': '1.795', 'memory/max_active (GiB)': '25.94', 'memory/max_allocated (GiB)': '25.94', 'memory/device_reserved (GiB)': '33.87', 'tokens/train_per_sec_per_gpu': '263.3', 'tokens/trainable': 3617272, 'tokens/total': 7882296, 'epoch': '0.9804'}

 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌   | 257/263 [3:20:16<05:10, 51.73s/it]
 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  | 258/263 [3:21:05<04:14, 50.88s/it]
                                                                                                                                                                                              
{'loss': '0.709', 'grad_norm': '0.2294', 'learning_rate': '3.161e-07', 'ppl': '2.032', 'memory/max_active (GiB)': '46.35', 'memory/max_allocated (GiB)': '46.35', 'memory/device_reserved (GiB)': '65.2', 'tokens/train_per_sec_per_gpu': '308.2', 'tokens/trainable': 3632340, 'tokens/total': 7915626, 'epoch': '0.9843'}

 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  | 258/263 [3:21:05<04:14, 50.88s/it]
 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋  | 259/263 [3:21:34<02:57, 44.37s/it]
                                                                                                                                                                                              
{'loss': '0.5761', 'grad_norm': '0.2068', 'learning_rate': '2.196e-07', 'ppl': '1.779', 'memory/max_active (GiB)': '24.15', 'memory/max_allocated (GiB)': '24.15', 'memory/device_reserved (GiB)': '39.5', 'tokens/train_per_sec_per_gpu': '365.9', 'tokens/trainable': 3643011, 'tokens/total': 7937480, 'epoch': '0.9881'}

 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋  | 259/263 [3:21:34<02:57, 44.37s/it]
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 260/263 [3:22:14<02:09, 43.19s/it]
                                                                                                                                                                                              
{'loss': '0.654', 'grad_norm': '0.1978', 'learning_rate': '1.405e-07', 'ppl': '1.923', 'memory/max_active (GiB)': '29.8', 'memory/max_allocated (GiB)': '29.8', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '340.3', 'tokens/trainable': 3656769, 'tokens/total': 7966668, 'epoch': '0.9919'}

 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 260/263 [3:22:14<02:09, 43.19s/it]
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 261/263 [3:23:10<01:34, 47.04s/it]
                                                                                                                                                                                              
{'loss': '0.6804', 'grad_norm': '0.2136', 'learning_rate': '7.906e-08', 'ppl': '1.975', 'memory/max_active (GiB)': '32.49', 'memory/max_allocated (GiB)': '32.49', 'memory/device_reserved (GiB)': '41.61', 'tokens/train_per_sec_per_gpu': '282.8', 'tokens/trainable': 3672618, 'tokens/total': 8001206, 'epoch': '0.9957'}

 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 261/263 [3:23:10<01:34, 47.04s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 262/263 [3:24:03<00:48, 48.74s/it]
                                                                                                                                                                                              
{'loss': '0.5862', 'grad_norm': '0.2206', 'learning_rate': '3.514e-08', 'ppl': '1.797', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '40.98', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3683361, 'tokens/total': 8024868, 'epoch': '0.9995'}

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 262/263 [3:24:03<00:48, 48.74s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [3:24:08<00:00, 35.60s/it]
                                                                                                                                                                                              
{'loss': '0.6559', 'grad_norm': '0.6503', 'learning_rate': '8.786e-09', 'ppl': '1.927', 'memory/max_active (GiB)': '17.39', 'memory/max_allocated (GiB)': '17.39', 'memory/device_reserved (GiB)': '32.52', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3684370, 'tokens/total': 8026968, 'epoch': '1'}

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [3:24:08<00:00, 35.60s/it][2026-06-14 17:37:06,485] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B/checkpoint-263

                                                                                                                                                                                              
{'train_runtime': '1.225e+04', 'train_samples_per_second': '0.343', 'train_steps_per_second': '0.021', 'train_loss': '0.731', 'memory/max_active (GiB)': '11.01', 'memory/max_allocated (GiB)': '11.01', 'memory/device_reserved (GiB)': '20.05', 'epoch': '1'}

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [3:24:11<00:00, 35.60s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [3:24:11<00:00, 46.58s/it]
[2026-06-14 17:37:12,240] [INFO] [axolotl.train.save_trained_model:267] [PID:3393] Training completed! Saving trained model to ./outputs/Jacob-2-E4B.
[2026-06-14 17:37:12,857] [INFO] [axolotl.train.save_trained_model:388] [PID:3393] Model successfully saved to ./outputs/Jacob-2-E4B
[2026-06-14 17:37:13,576] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B

Processing Files (0 / 0)      : |                                                                                                                                |  0.00B /  0.00B            

New Data Upload               : |                                                                                                                                |  0.00B /  0.00B            [A


  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            [A[A



  ...adapter_model.safetensors:  69%|█████████████████████████████████████████████████████████████████████████████████████▏                                      | 95.9MB /  140MB            [A[A[A




  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB            [A[A[A[A


  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            [A[A



  ...adapter_model.safetensors:  69%|█████████████████████████████████████████████████████████████████████████████████████▏                                      | 95.9MB /  140MB            [A[A[A




  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB            [A[A[A[A
Processing Files (2 / 3)      :  75%|████████████████████████████████████████████████████████████████████████████████████████████▍                               |  128MB /  172MB,   ???B/s  


  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            [A[A



  ...adapter_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  140MB /  140MB            [A[A[A




  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB            [A[A[A[A
Processing Files (3 / 3)      : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  172MB /  172MB,  219MB/s  


  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            [A[A



  ...adapter_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  140MB /  140MB            [A[A[A




  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB            [A[A[A[A


  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            [A[A



  ...adapter_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  140MB /  140MB            [A[A[A




  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB            [A[A[A[A
Processing Files (3 / 3)      : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  172MB /  172MB,  109MB/s  

New Data Upload               : |                                                                                                                                |  0.00B /  0.00B,  0.00B/s  

  ...b-2-E4B/training_args.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2kB / 24.2kB            

  ...adapter_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  140MB /  140MB            

  ...acob-2-E4B/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2MB / 32.2MB