Instructions to use jacob-ml/Jacob-2-E4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jacob-ml/Jacob-2-E4B with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it")
model = PeftModel.from_pretrained(base_model, "jacob-ml/Jacob-2-E4B")

Transformers

How to use jacob-ml/Jacob-2-E4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="jacob-ml/Jacob-2-E4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("jacob-ml/Jacob-2-E4B")
model = AutoModelForMultimodalLM.from_pretrained("jacob-ml/Jacob-2-E4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use jacob-ml/Jacob-2-E4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jacob-ml/Jacob-2-E4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jacob-ml/Jacob-2-E4B

SGLang

How to use jacob-ml/Jacob-2-E4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jacob-ml/Jacob-2-E4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jacob-ml/Jacob-2-E4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jacob-ml/Jacob-2-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use jacob-ml/Jacob-2-E4B with Docker Model Runner:
```
docker model run hf.co/jacob-ml/Jacob-2-E4B
```

Jacob-2-E4B / debug.log

mags0ft

End of training

43a2674 verified 15 days ago

Raw

History Blame Contribute Delete

410 kB

	[2026-06-14 14:08:51,250] [DEBUG] [axolotl.utils.config.resolve_dtype:74] [PID:3393] bf16 support detected, enabling for this configuration.
	[2026-06-14 14:08:51,433] [WARNING] [axolotl.utils.config.normalize_config:281] [PID:3393] Gemma4 requires use_reentrant=False for gradient checkpointing in distributed training. Setting use_reentrant=False.
	[2026-06-14 14:08:51,433] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:3393] baseline 0.000GB ()
	[2026-06-14 14:08:51,434] [INFO] [axolotl.cli.config.load_cfg:333] [PID:3393] config:
	{
	"activation_offloading": true,
	"adapter": "lora",
	"attn_implementation": "sdpa",
	"attn_needs_dtype_cast": false,
	"attn_supports_packing": false,
	"attn_uses_flash_lib": false,
	"axolotl_config_path": "./config.yaml",
	"base_model": "google/gemma-4-E4B-it",
	"base_model_config": "google/gemma-4-E4B-it",
	"batch_size": 16,
	"bf16": true,
	"capabilities": {
	"bf16": true,
	"compute_capability": "sm_80",
	"fp8": false,
	"n_gpu": 1,
	"n_node": 1,
	"tf32": true
	},
	"chat_template": "jinja",
	"chat_template_jinja": "./jinja",
	"context_parallel_size": 1,
	"cut_cross_entropy": true,
	"dataloader_num_workers": 1,
	"dataloader_pin_memory": true,
	"dataloader_prefetch_factor": 256,
	"dataset_num_proc": 31,
	"dataset_prepared_path": "./dataset-e4b",
	"datasets": [
	{
	"chat_template": "tokenizer_default",
	"field_messages": "messages",
	"field_tools": "tools",
	"message_property_mappings": {
	"content": "content",
	"role": "role"
	},
	"path": "jacob-ml/Jacob-2-SSFT-filtered",
	"split": "train",
	"trust_remote_code": false,
	"type": "chat_template"
	}
	],
	"ddp": false,
	"device": "cuda:0",
	"dion_rank_fraction": 1.0,
	"dion_rank_multiple_of": 1,
	"eaft_alpha": 1.0,
	"eaft_k": 20,
	"env_capabilities": {
	"torch_version": "2.10.0"
	},
	"eval_batch_size": 2,
	"eval_causal_lm_metrics": [
	"sacrebleu",
	"comet",
	"ter",
	"chrf"
	],
	"eval_max_new_tokens": 128,
	"eval_table_size": 0,
	"experimental_skip_move_to_device": true,
	"fp16": false,
	"freeze_mm_modules": true,
	"generate_samples": false,
	"generation_do_sample": true,
	"generation_max_new_tokens": 50,
	"generation_prompt_ratio": 0.5,
	"generation_temperature": 0.7,
	"gradient_accumulation_steps": 8,
	"gradient_checkpointing": true,
	"gradient_checkpointing_kwargs": {
	"use_reentrant": false
	},
	"hub_model_id": "jacob-ml/Jacob-2-E4B",
	"include_tkps": true,
	"is_multimodal": true,
	"layer_offloading": true,
	"learning_rate": 0.0002,
	"lisa_layers_attribute": "model.layers",
	"load_best_model_at_end": false,
	"load_in_4bit": false,
	"load_in_8bit": true,
	"local_rank": 0,
	"logging_steps": 1,
	"lora_alpha": 16,
	"lora_dropout": 0.0,
	"lora_r": 16,
	"lora_target_modules": "model.language_model.layers.[\\d]+.(_checkpoint_wrapped_module.)?(mlp\|self_attn).(up\|down\|gate\|q\|k\|v\|o)_proj",
	"loraplus_lr_embedding": 1e-06,
	"lr_scheduler": "cosine",
	"mean_resizing_embeddings": false,
	"merge_method": "memory_efficient",
	"micro_batch_size": 2,
	"model_config_type": "gemma4",
	"model_config_type_text": "gemma4_text",
	"num_epochs": 1.0,
	"num_generation_samples": 3,
	"optimizer": "adamw_torch_8bit",
	"otel_metrics_host": "localhost",
	"otel_metrics_port": 8000,
	"output_dir": "./outputs/Jacob-2-E4B",
	"pad_to_sequence_len": false,
	"plugins": [
	"axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin"
	],
	"pretrain_multipack_attn": true,
	"processor_config": "google/gemma-4-E4B-it",
	"profiler_steps_start": 0,
	"qgalore_cos_threshold": 0.4,
	"qgalore_gamma_proj": 2,
	"qgalore_proj_bits": 4,
	"qgalore_proj_group_size": 256,
	"qgalore_proj_quant": true,
	"qgalore_proj_type": "std",
	"qgalore_queue_size": 5,
	"qgalore_rank": 256,
	"qgalore_scale": 0.25,
	"qgalore_update_proj_gap": 200,
	"qlora_sharded_model_loading": false,
	"quantize_moe_experts": false,
	"ray_num_workers": 1,
	"relora_prune_method": "magnitude",
	"resources_per_worker": {
	"GPU": 1
	},
	"sample_packing": false,
	"sample_packing_bin_size": 200,
	"sample_packing_group_size": 100000,
	"save_only_model": false,
	"save_safetensors": true,
	"sequence_len": 8192,
	"shuffle_before_merging_datasets": false,
	"shuffle_merged_datasets": true,
	"skip_prepare_dataset": false,
	"streaming_multipack_buffer_size": 10000,
	"strict": false,
	"tensor_parallel_size": 1,
	"tf32": false,
	"tiled_mlp_use_original_mlp": true,
	"tokenizer_config": "google/gemma-4-E4B-it",
	"tokenizer_save_jinja_files": true,
	"torch_dtype": "torch.bfloat16",
	"train_on_inputs": false,
	"trl": {
	"async_prefetch": false,
	"log_completions": false,
	"mask_truncated_completions": false,
	"ref_model_mixup_alpha": 0.9,
	"ref_model_sync_steps": 64,
	"replay_buffer_size": 0,
	"replay_recompute_logps": true,
	"reroll_max_groups": 1,
	"reroll_start_fraction": 1.0,
	"reward_num_workers": 1,
	"scale_rewards": true,
	"skip_zero_advantage_batches": true,
	"sync_ref_model": false,
	"use_data_producer": false,
	"use_vllm": false,
	"vllm_lora_sync": false,
	"vllm_server_host": "0.0.0.0",
	"vllm_server_port": 8000
	},
	"use_otel_metrics": false,
	"use_ray": false,
	"val_set_size": 0.0,
	"vllm": {
	"device": "auto",
	"dtype": "auto",
	"gpu_memory_utilization": 0.9,
	"host": "0.0.0.0",
	"port": 8000
	},
	"warmup_ratio": 0.1,
	"weight_decay": 0.0,
	"world_size": 1
	}
	[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos>
	[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos>
	[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad>
	[2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk>
	[2026-06-14 14:08:55,227] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:482] [PID:3393] Unable to find prepared dataset in dataset-e4b/226f5539ba5a2355ba6a34bd68b2a326
	[2026-06-14 14:08:55,228] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:3393] Loading raw datasets...
	[2026-06-14 14:08:55,228] [WARNING] [axolotl.utils.data.sft._load_raw_datasets:322] [PID:3393] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset using `axolotl preprocess path/to/config.yml`.
	Downloading (incomplete total...): 0.00B [00:00, ?B/s]
	Fetching 0 files: 0it [00:00, ?it/s][A Fetching 0 files: 0it [00:00, ?it/s]
	Download complete: : 0.00B [00:00, ?B/s] Download complete: : 0.00B [00:00, ?B/s]
	[2026-06-14 14:08:56,312] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:3393] Loading dataset: jacob-ml/Jacob-2-SSFT-filtered with base_type: chat_template and prompt_style: None
	[2026-06-14 14:08:56,315] [INFO] [axolotl.prompt_strategies.chat_template.__call__:1209] [PID:3393] Using chat template:
	---
	{%- macro format_parameters(properties, required, filter_keys=false) -%}
	{%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%}
	{%- set ns = namespace(found_first=false) -%}
	{%- for key, value in properties \| dictsort -%}
	{%- set add_comma = false -%}
	{%- if not filter_keys or key not in standard_keys -%}
	{%- if ns.found_first %},{% endif -%}
	{%- set ns.found_first = true -%}
	{{ key }}:{
	{%- if value['description'] -%}
	description:<\|"\|>{{ value['description'] }}<\|"\|>
	{%- set add_comma = true -%}
	{%- endif -%}
	{%- if value['type'] \| upper == 'STRING' -%}
	{%- if value['enum'] -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	enum:{{ format_argument(value['enum']) }}
	{%- endif -%}
	{%- elif value['type'] \| upper == 'ARRAY' -%}
	{%- if value['items'] is mapping and value['items'] -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	items:{
	{%- set ns_items = namespace(found_first=false) -%}
	{%- for item_key, item_value in value['items'] \| dictsort -%}
	{%- if item_value is not none -%}
	{%- if ns_items.found_first %},{% endif -%}
	{%- set ns_items.found_first = true -%}
	{%- if item_key == 'properties' -%}
	properties:{
	{%- if item_value is mapping -%}
	{{- format_parameters(item_value, value['items']['required'] \| default([])) -}}
	{%- endif -%}
	}
	{%- elif item_key == 'required' -%}
	required:[
	{%- for req_item in item_value -%}
	<\|"\|>{{- req_item -}}<\|"\|>
	{%- if not loop.last %},{% endif -%}
	{%- endfor -%}
	]
	{%- elif item_key == 'type' -%}
	{%- if item_value is string -%}
	type:{{ format_argument(item_value \| upper) }}
	{%- else -%}
	type:{{ format_argument(item_value \| map('upper') \| list) }}
	{%- endif -%}
	{%- else -%}
	{{ item_key }}:{{ format_argument(item_value) }}
	{%- endif -%}
	{%- endif -%}
	{%- endfor -%}
	}
	{%- endif -%}
	{%- endif -%}
	{%- if value['nullable'] %}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	nullable:true
	{%- endif -%}
	{%- if value['type'] \| upper == 'OBJECT' -%}
	{%- if value['properties'] is defined and value['properties'] is mapping -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	properties:{
	{{- format_parameters(value['properties'], value['required'] \| default([])) -}}
	}
	{%- elif value is mapping -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	properties:{
	{{- format_parameters(value, value['required'] \| default([]), filter_keys=true) -}}
	}
	{%- endif -%}
	{%- if value['required'] -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	required:[
	{%- for item in value['required'] \| default([]) -%}
	<\|"\|>{{- item -}}<\|"\|>
	{%- if not loop.last %},{% endif -%}
	{%- endfor -%}
	]
	{%- endif -%}
	{%- endif -%}
	{%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%}
	type:<\|"\|>{{ value['type'] \| upper }}<\|"\|>}
	{%- endif -%}
	{%- endfor -%}
	{%- endmacro -%}
	{%- macro format_function_declaration(tool_data) -%}
	declaration:{{- tool_data['function']['name'] -}}{description:<\|"\|>{{- tool_data['function']['description'] -}}<\|"\|>
	{%- set params = tool_data['function']['parameters'] -%}
	{%- if params -%}
	,parameters:{
	{%- if params['properties'] -%}
	properties:{ {{- format_parameters(params['properties'], params['required']) -}} },
	{%- endif -%}
	{%- if params['required'] -%}
	required:[
	{%- for item in params['required'] -%}
	<\|"\|>{{- item -}}<\|"\|>
	{{- ',' if not loop.last -}}
	{%- endfor -%}
	],
	{%- endif -%}
	{%- if params['type'] -%}
	type:<\|"\|>{{- params['type'] \| upper -}}<\|"\|>}
	{%- endif -%}
	{%- endif -%}
	{%- if 'response' in tool_data['function'] -%}
	{%- set response_declaration = tool_data['function']['response'] -%}
	,response:{
	{%- if response_declaration['description'] -%}
	description:<\|"\|>{{- response_declaration['description'] -}}<\|"\|>,
	{%- endif -%}
	{%- if response_declaration['type'] \| upper == 'OBJECT' -%}
	type:<\|"\|>{{- response_declaration['type'] \| upper -}}<\|"\|>}
	{%- endif -%}
	{%- endif -%}
	}
	{%- endmacro -%}
	{%- macro format_argument(argument, escape_keys=True) -%}
	{%- if argument is string -%}
	{{- '<\|"\|>' + argument + '<\|"\|>' -}}
	{%- elif argument is boolean -%}
	{{- 'true' if argument else 'false' -}}
	{%- elif argument is mapping -%}
	{{- '{' -}}
	{%- set ns = namespace(found_first=false) -%}
	{%- for key, value in argument \| dictsort -%}
	{%- if ns.found_first %},{% endif -%}
	{%- set ns.found_first = true -%}
	{%- if escape_keys -%}
	{{- '<\|"\|>' + key + '<\|"\|>' -}}
	{%- else -%}
	{{- key -}}
	{%- endif -%}
	:{{- format_argument(value, escape_keys=escape_keys) -}}
	{%- endfor -%}
	{{- '}' -}}
	{%- elif argument is sequence -%}
	{{- '[' -}}
	{%- for item in argument -%}
	{{- format_argument(item, escape_keys=escape_keys) -}}
	{%- if not loop.last %},{% endif -%}
	{%- endfor -%}
	{{- ']' -}}
	{%- else -%}
	{{- argument -}}
	{%- endif -%}
	{%- endmacro -%}
	{%- macro strip_thinking(text) -%}
	{%- set ns = namespace(result='') -%}
	{%- for part in text.split('<channel\|>') -%}
	{%- if '<\|channel>' in part -%}
	{%- set ns.result = ns.result + part.split('<\|channel>')[0] -%}
	{%- else -%}
	{%- set ns.result = ns.result + part -%}
	{%- endif -%}
	{%- endfor -%}
	{{- ns.result \| trim -}}
	{%- endmacro -%}

	{%- macro format_tool_response_block(tool_name, response) -%}
	{{- '<\|tool_response>' -}}
	{%- if response is mapping -%}
	{{- 'response:' + tool_name + '{' -}}
	{%- for key, value in response \| dictsort -%}
	{{- key -}}:{{- format_argument(value, escape_keys=False) -}}
	{%- if not loop.last %},{% endif -%}
	{%- endfor -%}
	{{- '}' -}}
	{%- else -%}
	{{- 'response:' + tool_name + '{value:' + format_argument(response, escape_keys=False) + '}' -}}
	{%- endif -%}
	{{- '<tool_response\|>' -}}
	{%- endmacro -%}

	{%- set ns = namespace(prev_message_type=None) -%}
	{%- set loop_messages = messages -%}
	{{- bos_token -}}
	{#- Handle System/Tool Definitions Block -#}
	{%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%}
	{{- '<\|turn>system\n' -}}
	{#- Inject Thinking token at the very top of the FIRST system turn -#}
	{%- if enable_thinking is defined and enable_thinking -%}
	{{- '<\|think\|>\n' -}}
	{%- set ns.prev_message_type = 'think' -%}
	{%- endif -%}
	{%- if messages[0]['role'] in ['system', 'developer'] -%}
	{%- if messages[0]['content'] is string -%}
	{{- messages[0]['content'] \| trim -}}
	{%- elif messages[0]['content'] is sequence -%}
	{%- for item in messages[0]['content'] -%}
	{{- item['text'] \| trim + ' '-}}
	{%- endfor -%}
	{%- endif -%}
	{%- set loop_messages = messages[1:] -%}
	{%- endif -%}
	{%- if tools -%}
	{%- for tool in tools %}
	{{- '<\|tool>' -}}
	{{- format_function_declaration(tool) \| trim -}}
	{{- '<tool\|>' -}}
	{%- endfor %}
	{%- set ns.prev_message_type = 'tool' -%}
	{%- endif -%}
	{{- '<turn\|>\n' -}}
	{%- endif %}

	{#- Pre-scan: find last user message index for reasoning guard -#}
	{%- set ns_turn = namespace(last_user_idx=-1) -%}
	{%- for i in range(loop_messages \| length) -%}
	{%- if loop_messages[i]['role'] == 'user' -%}
	{%- set ns_turn.last_user_idx = i -%}
	{%- endif -%}
	{%- endfor -%}

	{#- Loop through messages -#}
	{%- for message in loop_messages -%}
	{%- if message['role'] != 'tool' -%}
	{%- set ns.prev_message_type = None -%}
	{%- set role = 'model' if message['role'] == 'assistant' else message['role'] -%}
	{%- if message['role'] == 'tool' and 'name' in message -%}
	{%- set _tool_name = message['name'] -%}
	{%- endif -%}
	{#- Detect continuation: suppress duplicate <\|turn>model when previous non-tool message was also assistant -#}
	{%- set prev_nt = namespace(role=None, found=false) -%}
	{%- if loop.index0 > 0 -%}
	{%- for j in range(loop.index0 - 1, -1, -1) -%}
	{%- if not prev_nt.found -%}
	{%- if loop_messages[j]['role'] != 'tool' -%}
	{%- set prev_nt.role = loop_messages[j]['role'] -%}
	{%- set prev_nt.found = true -%}
	{%- endif -%}
	{%- endif -%}
	{%- endfor -%}
	{%- endif -%}
	{%- set continue_same_model_turn = (role == 'model' and prev_nt.role == 'assistant') -%}
	{%- if not continue_same_model_turn -%}
	{{- '<\|turn>' + role + '\n' }}
	{%- endif -%}

	{#- Render reasoning/reasoning_content as thinking channel -#}
	{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
	{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
	{{- '<\|channel>thought\n' + thinking_text + '\n<channel\|>' -}}
	{%- endif -%}

	{%- if message['tool_calls'] -%}
	{%- for tool_call in message['tool_calls'] -%}
	{%- set function = tool_call['function'] -%}
	{{- '<\|tool_call>call:' + function['name'] + '{' -}}
	{%- if function['arguments'] is mapping -%}
	{%- set ns_args = namespace(found_first=false) -%}
	{%- for key, value in function['arguments'] \| dictsort -%}
	{%- if ns_args.found_first %},{% endif -%}
	{%- set ns_args.found_first = true -%}
	{{- key -}}:{{- format_argument(value, escape_keys=False) -}}
	{%- endfor -%}
	{%- elif function['arguments'] is string -%}
	{{- function['arguments'] -}}
	{%- endif -%}
	{{- '}<tool_call\|>' -}}
	{%- endfor -%}
	{%- set ns.prev_message_type = 'tool_call' -%}
	{%- endif -%}

	{%- set ns_tr_out = namespace(flag=false) -%}
	{%- if message.get('tool_responses') -%}
	{#- Legacy: tool_responses embedded on the assistant message (Google/Gemma native) -#}
	{%- for tool_response in message['tool_responses'] -%}
	{{- format_tool_response_block(tool_response['name'] \| default('unknown', true), tool_response['response']) -}}
	{%- set ns_tr_out.flag = true -%}
	{%- set ns.prev_message_type = 'tool_response' -%}
	{%- endfor -%}
	{%- elif message.get('tool_calls') -%}
	{#- OpenAI Chat Completions: forward-scan consecutive role:tool messages -#}
	{%- set ns_tool_scan = namespace(stopped=false) -%}
	{%- for k in range(loop.index0 + 1, loop_messages \| length) -%}
	{%- if ns_tool_scan.stopped -%}
	{%- elif loop_messages[k]['role'] != 'tool' -%}
	{%- set ns_tool_scan.stopped = true -%}
	{%- else -%}
	{%- set follow = loop_messages[k] -%}
	{#- Resolve tool_call_id to function name -#}
	{%- set ns_tname = namespace(name=follow['name'] \| default('unknown', true)) -%}
	{%- for tc in message['tool_calls'] -%}
	{%- if tc.get('id') == follow.get('tool_call_id') -%}
	{%- set ns_tname.name = tc['function']['name'] -%}
	{%- endif -%}
	{%- endfor -%}
	{#- Handle content as string or content-parts array -#}
	{%- set tool_body = follow.get('content') -%}
	{%- if tool_body is string -%}
	{{- format_tool_response_block(ns_tname.name, tool_body) -}}
	{%- elif tool_body is sequence and tool_body is not string -%}
	{%- set ns_txt = namespace(s='') -%}
	{%- for part in tool_body -%}
	{%- if part.get('type') == 'text' -%}
	{%- set ns_txt.s = ns_txt.s + (part.get('text') \| default('')) -%}
	{%- endif -%}
	{%- endfor -%}
	{{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
	{%- for part in tool_body -%}
	{%- if part.get('type') == 'image' -%}
	{{- '<\|image\|>' -}}
	{%- elif part.get('type') == 'audio' -%}
	{{- '<\|audio\|>' -}}
	{%- elif part.get('type') == 'video' -%}
	{{- '<\|video\|>' -}}
	{%- endif -%}
	{%- endfor -%}
	{%- else -%}
	{{- format_tool_response_block(ns_tname.name, tool_body) -}}
	{%- endif -%}
	{%- set ns_tr_out.flag = true -%}
	{%- set ns.prev_message_type = 'tool_response' -%}
	{%- endif -%}
	{%- endfor -%}
	{%- endif -%}

	{%- set captured_content -%}
	{%- if message['content'] is string -%}
	{%- if role == 'model' -%}
	{{- strip_thinking(message['content']) -}}
	{%- else -%}
	{{- message['content'] \| trim -}}
	{%- endif -%}
	{%- elif message['content'] is sequence -%}
	{%- for item in message['content'] -%}
	{%- if item['type'] == 'text' -%}
	{%- if role == 'model' -%}
	{{- strip_thinking(item['text']) -}}
	{%- else -%}
	{{- item['text'] \| trim -}}
	{%- endif -%}
	{%- elif item['type'] == 'image' -%}
	{{- '<\|image\|>' -}}
	{%- set ns.prev_message_type = 'image' -%}
	{%- elif item['type'] == 'audio' -%}
	{{- '<\|audio\|>' -}}
	{%- set ns.prev_message_type = 'audio' -%}
	{%- elif item['type'] == 'video' -%}
	{{- '<\|video\|>' -}}
	{%- set ns.prev_message_type = 'video' -%}
	{%- endif -%}
	{%- endfor -%}
	{%- endif -%}
	{%- endset -%}

	{{- captured_content -}}
	{%- set has_content = captured_content \| trim \| length > 0 -%}

	{%- if ns.prev_message_type == 'tool_call' and not ns_tr_out.flag -%}
	{{- '<\|tool_response>' -}}
	{%- elif not (ns_tr_out.flag and not has_content) -%}
	{{- '<turn\|>\n' -}}
	{%- endif -%}
	{%- endif -%}
	{%- endfor -%}

	{%- if add_generation_prompt -%}
	{%- if ns.prev_message_type != 'tool_response' and ns.prev_message_type != 'tool_call' -%}
	{{- '<\|turn>model\n' -}}
	{%- endif -%}
	{%- endif -%}

	---
	[2026-06-14 14:08:56,434] [WARNING] [axolotl.prompt_strategies.chat_template._validate_eot_and_eos_tokens:357] [PID:3393] EOS token '<eos>' not found in chat_template. Please check if your template/EOS token is correct.
	Tokenizing Prompts (num_proc=31): 0%\| \| 0/4209 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=31): 3%\|███▌ \| 136/4209 [00:10<05:02, 13.47 examples/s] Tokenizing Prompts (num_proc=31): 6%\|███████ \| 272/4209 [00:15<03:38, 18.03 examples/s] Tokenizing Prompts (num_proc=31): 10%\|██████████▌ \| 408/4209 [00:21<03:05, 20.53 examples/s] Tokenizing Prompts (num_proc=31): 13%\|██████████████ \| 544/4209 [00:26<02:38, 23.07 examples/s] Tokenizing Prompts (num_proc=31): 16%\|█████████████████▌ \| 680/4209 [00:30<02:18, 25.49 examples/s] Tokenizing Prompts (num_proc=31): 19%\|█████████████████████▏ \| 816/4209 [00:35<02:09, 26.25 examples/s] Tokenizing Prompts (num_proc=31): 23%\|████████████████████████▋ \| 952/4209 [00:39<01:52, 28.90 examples/s] Tokenizing Prompts (num_proc=31): 26%\|███████████████████████████▉ \| 1088/4209 [00:43<01:44, 29.81 examples/s] Tokenizing Prompts (num_proc=31): 29%\|███████████████████████████████▍ \| 1224/4209 [00:47<01:39, 29.92 examples/s] Tokenizing Prompts (num_proc=31): 32%\|██████████████████████████████████▉ \| 1360/4209 [00:51<01:31, 31.06 examples/s] Tokenizing Prompts (num_proc=31): 36%\|██████████████████████████████████████▍ \| 1496/4209 [00:56<01:25, 31.57 examples/s] Tokenizing Prompts (num_proc=31): 39%\|█████████████████████████████████████████▉ \| 1632/4209 [01:00<01:21, 31.66 examples/s] Tokenizing Prompts (num_proc=31): 42%\|█████████████████████████████████████████████▎ \| 1768/4209 [01:03<01:12, 33.71 examples/s] Tokenizing Prompts (num_proc=31): 45%\|████████████████████████████████████████████████▊ \| 1904/4209 [01:07<01:05, 35.43 examples/s] Tokenizing Prompts (num_proc=31): 48%\|████████████████████████████████████████████████████▎ \| 2040/4209 [01:12<01:07, 31.92 examples/s] Tokenizing Prompts (num_proc=31): 52%\|███████████████████████████████████████████████████████▊ \| 2176/4209 [01:16<01:03, 32.02 examples/s] Tokenizing Prompts (num_proc=31): 55%\|███████████████████████████████████████████████████████████▎ \| 2312/4209 [01:20<00:57, 32.98 examples/s] Tokenizing Prompts (num_proc=31): 58%\|██████████████████████████████████████████████████████████████▊ \| 2448/4209 [01:24<00:54, 32.61 examples/s] Tokenizing Prompts (num_proc=31): 61%\|██████████████████████████████████████████████████████████████████▎ \| 2584/4209 [01:28<00:49, 32.89 examples/s] Tokenizing Prompts (num_proc=31): 65%\|█████████████████████████████████████████████████████████████████████▊ \| 2720/4209 [01:32<00:44, 33.45 examples/s] Tokenizing Prompts (num_proc=31): 68%\|█████████████████████████████████████████████████████████████████████████▎ \| 2856/4209 [01:36<00:38, 35.19 examples/s] Tokenizing Prompts (num_proc=31): 71%\|████████████████████████████████████████████████████████████████████████████▊ \| 2992/4209 [01:40<00:36, 33.45 examples/s] Tokenizing Prompts (num_proc=31): 74%\|████████████████████████████████████████████████████████████████████████████████▎ \| 3128/4209 [01:44<00:32, 33.28 examples/s] Tokenizing Prompts (num_proc=31): 78%\|███████████████████████████████████████████████████████████████████████████████████▊ \| 3264/4209 [01:48<00:28, 33.11 examples/s] Tokenizing Prompts (num_proc=31): 81%\|███████████████████████████████████████████████████████████████████████████████████████▏ \| 3399/4209 [01:52<00:24, 33.56 examples/s] Tokenizing Prompts (num_proc=31): 84%\|██████████████████████████████████████████████████████████████████████████████████████████▋ \| 3534/4209 [01:56<00:20, 33.67 examples/s] Tokenizing Prompts (num_proc=31): 87%\|██████████████████████████████████████████████████████████████████████████████████████████████▏ \| 3669/4209 [02:01<00:17, 31.47 examples/s] Tokenizing Prompts (num_proc=31): 90%\|█████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 3804/4209 [02:04<00:11, 34.01 examples/s] Tokenizing Prompts (num_proc=31): 94%\|█████████████████████████████████████████████████████████████████████████████████████████████████████ \| 3939/4209 [02:09<00:08, 33.46 examples/s] Tokenizing Prompts (num_proc=31): 97%\|████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 4074/4209 [02:13<00:04, 33.27 examples/s] Tokenizing Prompts (num_proc=31): 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4209/4209 [02:18<00:00, 30.78 examples/s] Tokenizing Prompts (num_proc=31): 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4209/4209 [02:19<00:00, 30.16 examples/s]
	[2026-06-14 14:11:56,198] [INFO] [axolotl.utils.data.utils._log_dataset_stats:212] [PID:3393] min_input_len: 320
	[2026-06-14 14:11:56,199] [INFO] [axolotl.utils.data.utils._log_dataset_stats:213] [PID:3393] max_input_len: 23372
	Dropping Invalid Sequences (<None or >8192) (num_proc=31): 0%\| \| 0/4209 [00:00<?, ? examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 3%\|██▋ \| 136/4209 [00:01<00:58, 69.73 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 36%\|█████████████████████████████▏ \| 1496/4209 [00:02<00:02, 992.91 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 55%\|████████████████████████████████████████████▍ \| 2311/4209 [00:02<00:01, 1592.28 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 74%\|████████████████████████████████████████████████████████████▏ \| 3126/4209 [00:02<00:00, 2289.56 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 100%\|█████████████████████████████████████████████████████████████████████████████████\| 4209/4209 [00:02<00:00, 1595.10 examples/s]
	[2026-06-14 14:11:58,919] [INFO] [axolotl.utils.data.utils._drop_outside_range:306] [PID:3393] Dropped 15 sequences outside valid range ([None, 8192])
	Saving the dataset (0/16 shards): 0%\| \| 0/4194 [00:00<?, ? examples/s] Saving the dataset (0/16 shards): 6%\|██████▊ \| 263/4194 [00:14<03:31, 18.60 examples/s] Saving the dataset (1/16 shards): 13%\|█████████████▋ \| 526/4194 [00:14<03:17, 18.60 examples/s] Saving the dataset (2/16 shards): 19%\|████████████████████▍ \| 788/4194 [00:14<03:03, 18.60 examples/s] Saving the dataset (3/16 shards): 19%\|████████████████████▍ \| 788/4194 [00:14<03:03, 18.60 examples/s] Saving the dataset (4/16 shards): 25%\|███████████████████████████ \| 1050/4194 [00:14<02:49, 18.60 examples/s] Saving the dataset (5/16 shards): 31%\|█████████████████████████████████▊ \| 1312/4194 [00:14<02:34, 18.60 examples/s] Saving the dataset (6/16 shards): 38%\|████████████████████████████████████████▌ \| 1574/4194 [00:14<02:20, 18.60 examples/s] Saving the dataset (7/16 shards): 44%\|███████████████████████████████████████████████▎ \| 1836/4194 [00:14<02:06, 18.60 examples/s] Saving the dataset (8/16 shards): 50%\|██████████████████████████████████████████████████████ \| 2098/4194 [00:14<01:52, 18.60 examples/s] Saving the dataset (9/16 shards): 56%\|████████████████████████████████████████████████████████████▊ \| 2360/4194 [00:14<01:38, 18.60 examples/s] Saving the dataset (10/16 shards): 63%\|██████████████████████████████████████████████████████████████████▉ \| 2622/4194 [00:14<01:24, 18.60 examples/s] Saving the dataset (11/16 shards): 69%\|█████████████████████████████████████████████████████████████████████████▌ \| 2884/4194 [00:14<01:10, 18.60 examples/s] Saving the dataset (12/16 shards): 75%\|████████████████████████████████████████████████████████████████████████████████▎ \| 3146/4194 [00:14<00:56, 18.60 examples/s] Saving the dataset (13/16 shards): 81%\|██████████████████████████████████████████████████████████████████████████████████████▉ \| 3408/4194 [00:14<00:42, 18.60 examples/s] Saving the dataset (14/16 shards): 88%\|█████████████████████████████████████████████████████████████████████████████████████████████▋ \| 3670/4194 [00:14<00:28, 18.60 examples/s] Saving the dataset (15/16 shards): 94%\|████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 3932/4194 [00:14<00:14, 18.60 examples/s] Saving the dataset (16/16 shards): 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4194/4194 [00:14<00:00, 18.60 examples/s] Saving the dataset (16/16 shards): 100%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████\| 4194/4194 [00:15<00:00, 270.70 examples/s]
	[2026-06-14 14:12:15,030] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:420] [PID:3393] total_num_tokens: 6_012_949
	[2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:438] [PID:3393] `total_supervised_tokens: 3_764_690`
	[2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:521] [PID:3393] total_num_steps: 263
	[2026-06-14 14:12:15,143] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:121] [PID:3393] Maximum number of steps set at 263
	[2026-06-14 14:12:15,405] [DEBUG] [axolotl.train.setup_model_and_tokenizer:70] [PID:3393] loading tokenizer... google/gemma-4-E4B-it
	[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos>
	[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos>
	[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad>
	[2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk>
	[2026-06-14 14:12:24,886] [DEBUG] [axolotl.train.setup_model_and_tokenizer:81] [PID:3393] Loading model
	[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:75] [PID:3393] Patched OptimState8bit for torch.compile compatibility
	[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:122] [PID:3393] Patched OptimState4bit for torch.compile compatibility
	[2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:154] [PID:3393] Patched OptimStateFp8 for torch.compile compatibility
	[2026-06-14 14:12:24,936] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:94] [PID:3393] Patched Trainer.evaluation_loop with nanmean loss calculation
	[2026-06-14 14:12:24,937] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:148] [PID:3393] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation
	[2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:207] [PID:3393] Patched Gemma4TextAttention.forward with fused RMSNorm+RoPE Triton kernels
	[2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:211] [PID:3393] Installed Gemma4 shared_kv_states side channel (PR #3611)
	[2026-06-14 14:12:25,062] [INFO] [axolotl.integrations.cut_cross_entropy.pre_model_load:94] [PID:3393] Applying Cut Cross Entropy to model type: gemma4
	Loading weights: 0%\| \| 0/2076 [00:00<?, ?it/s] Loading weights: 2%\|██▊ \| 43/2076 [00:00<00:04, 421.41it/s] Loading weights: 4%\|█████▌ \| 86/2076 [00:00<00:04, 408.21it/s] Loading weights: 7%\|████████▊ \| 139/2076 [00:00<00:04, 422.30it/s] Loading weights: 9%\|████████████ \| 189/2076 [00:00<00:04, 437.88it/s] Loading weights: 11%\|██████████████▊ \| 233/2076 [00:00<00:04, 414.25it/s] Loading weights: 13%\|█████████████████▍ \| 275/2076 [00:00<00:05, 357.36it/s] Loading weights: 15%\|████████████████████▏ \| 318/2076 [00:00<00:04, 371.39it/s] Loading weights: 18%\|███████████████████████▌ \| 370/2076 [00:00<00:04, 408.91it/s] Loading weights: 20%\|██████████████████████████▏ \| 412/2076 [00:01<00:04, 376.62it/s] Loading weights: 22%\|████████████████████████████▋ \| 451/2076 [00:01<00:04, 358.21it/s] Loading weights: 24%\|███████████████████████████████▋ \| 499/2076 [00:01<00:04, 387.37it/s] Loading weights: 26%\|██████████████████████████████████▎ \| 539/2076 [00:01<00:04, 372.42it/s] Loading weights: 28%\|████████████████████████████████████▋ \| 577/2076 [00:01<00:04, 361.99it/s] Loading weights: 30%\|███████████████████████████████████████▌ \| 623/2076 [00:01<00:03, 386.03it/s] Loading weights: 32%\|██████████████████████████████████████████▏ \| 663/2076 [00:01<00:03, 376.76it/s] Loading weights: 34%\|████████████████████████████████████████████▋ \| 702/2076 [00:01<00:03, 343.95it/s] Loading weights: 36%\|███████████████████████████████████████████████▉ \| 753/2076 [00:01<00:03, 379.34it/s] Loading weights: 38%\|██████████████████████████████████████████████████▋ \| 792/2076 [00:04<00:28, 44.87it/s] Loading weights: 39%\|████████████████████████████████████████████████████▌ \| 820/2076 [00:05<00:29, 42.60it/s] Loading weights: 41%\|█████████████████████████████████████████████████████▉ \| 841/2076 [00:06<00:29, 42.47it/s] Loading weights: 41%\|██████████████████████████████████████████████████████▉ \| 857/2076 [00:06<00:30, 39.90it/s] Loading weights: 42%\|███████████████████████████████████████████████████████▋ \| 869/2076 [00:07<00:32, 37.29it/s] Loading weights: 42%\|████████████████████████████████████████████████████████▎ \| 879/2076 [00:07<00:35, 33.32it/s] Loading weights: 43%\|█████████████████████████████████████████████████████████▎ \| 894/2076 [00:07<00:30, 38.83it/s] Loading weights: 43%\|█████████████████████████████████████████████████████████▊ \| 902/2076 [00:08<00:31, 37.01it/s] Loading weights: 44%\|██████████████████████████████████████████████████████████▎ \| 911/2076 [00:08<00:31, 37.35it/s] Loading weights: 44%\|██████████████████████████████████████████████████████████▋ \| 917/2076 [00:08<00:34, 33.24it/s] Loading weights: 45%\|███████████████████████████████████████████████████████████▍ \| 928/2076 [00:08<00:31, 36.45it/s] Loading weights: 45%\|███████████████████████████████████████████████████████████▊ \| 933/2076 [00:09<00:36, 30.96it/s] Loading weights: 46%\|████████████████████████████████████████████████████████████▌ \| 945/2076 [00:09<00:30, 36.87it/s] Loading weights: 46%\|████████████████████████████████████████████████████████████▊ \| 950/2076 [00:09<00:37, 29.74it/s] Loading weights: 46%\|█████████████████████████████████████████████████████████████▎ \| 958/2076 [00:09<00:31, 35.53it/s] Loading weights: 46%\|█████████████████████████████████████████████████████████████▋ \| 963/2076 [00:10<00:38, 29.12it/s] Loading weights: 47%\|█████████████████████████████████████████████████████████████▉ \| 967/2076 [00:10<00:38, 28.53it/s] Loading weights: 47%\|██████████████████████████████████████████████████████████████▋ \| 979/2076 [00:10<00:28, 38.12it/s] Loading weights: 47%\|███████████████████████████████████████████████████████████████ \| 984/2076 [00:10<00:34, 31.40it/s] Loading weights: 48%\|███████████████████████████████████████████████████████████████▊ \| 996/2076 [00:10<00:28, 38.38it/s] Loading weights: 48%\|███████████████████████████████████████████████████████████████▋ \| 1001/2076 [00:11<00:35, 30.58it/s] Loading weights: 49%\|████████████████████████████████████████████████████████████████▍ \| 1013/2076 [00:11<00:27, 38.54it/s] Loading weights: 49%\|████████████████████████████████████████████████████████████████▋ \| 1018/2076 [00:11<00:34, 30.96it/s] Loading weights: 50%\|█████████████████████████████████████████████████████████████████▍ \| 1030/2076 [00:11<00:26, 39.49it/s] Loading weights: 50%\|█████████████████████████████████████████████████████████████████▊ \| 1035/2076 [00:12<00:32, 32.23it/s] Loading weights: 50%\|██████████████████████████████████████████████████████████████████▌ \| 1047/2076 [00:12<00:25, 40.18it/s] Loading weights: 51%\|██████████████████████████████████████████████████████████████████▉ \| 1052/2076 [00:12<00:31, 32.63it/s] Loading weights: 51%\|███████████████████████████████████████████████████████████████████▍ \| 1060/2076 [00:12<00:26, 39.02it/s] Loading weights: 51%\|███████████████████████████████████████████████████████████████████▋ \| 1065/2076 [00:12<00:32, 31.33it/s] Loading weights: 51%\|███████████████████████████████████████████████████████████████████▉ \| 1069/2076 [00:13<00:33, 29.94it/s] Loading weights: 52%\|████████████████████████████████████████████████████████████████████▋ \| 1081/2076 [00:13<00:24, 40.31it/s] Loading weights: 52%\|█████████████████████████████████████████████████████████████████████ \| 1086/2076 [00:13<00:31, 31.80it/s] Loading weights: 53%\|█████████████████████████████████████████████████████████████████████▊ \| 1098/2076 [00:13<00:24, 39.81it/s] Loading weights: 53%\|██████████████████████████████████████████████████████████████████████▏ \| 1103/2076 [00:13<00:30, 32.41it/s] Loading weights: 54%\|██████████████████████████████████████████████████████████████████████▉ \| 1115/2076 [00:14<00:24, 39.72it/s] Loading weights: 54%\|███████████████████████████████████████████████████████████████████████▏ \| 1120/2076 [00:14<00:29, 32.22it/s] Loading weights: 55%\|███████████████████████████████████████████████████████████████████████▉ \| 1132/2076 [00:14<00:24, 38.07it/s] Loading weights: 55%\|████████████████████████████████████████████████████████████████████████▎ \| 1137/2076 [00:14<00:30, 31.08it/s] Loading weights: 55%\|█████████████████████████████████████████████████████████████████████████ \| 1149/2076 [00:15<00:24, 38.34it/s] Loading weights: 56%\|█████████████████████████████████████████████████████████████████████████▍ \| 1154/2076 [00:15<00:29, 31.05it/s] Loading weights: 56%\|█████████████████████████████████████████████████████████████████████████▉ \| 1162/2076 [00:15<00:24, 37.87it/s] Loading weights: 56%\|██████████████████████████████████████████████████████████████████████████▏ \| 1167/2076 [00:15<00:29, 31.18it/s] Loading weights: 56%\|██████████████████████████████████████████████████████████████████████████▍ \| 1171/2076 [00:15<00:30, 30.09it/s] Loading weights: 57%\|███████████████████████████████████████████████████████████████████████████ \| 1180/2076 [00:16<00:24, 37.17it/s] Loading weights: 57%\|███████████████████████████████████████████████████████████████████████████▎ \| 1185/2076 [00:16<00:30, 28.83it/s] Loading weights: 58%\|███████████████████████████████████████████████████████████████████████████▉ \| 1194/2076 [00:16<00:25, 34.39it/s] Loading weights: 58%\|████████████████████████████████████████████████████████████████████████████▏ \| 1198/2076 [00:16<00:35, 24.64it/s] Loading weights: 58%\|████████████████████████████████████████████████████████████████████████████▊ \| 1208/2076 [00:17<00:29, 29.76it/s] Loading weights: 58%\|█████████████████████████████████████████████████████████████████████████████ \| 1212/2076 [00:17<00:36, 23.76it/s] Loading weights: 59%\|█████████████████████████████████████████████████████████████████████████████▋ \| 1222/2076 [00:17<00:27, 30.84it/s] Loading weights: 59%\|█████████████████████████████████████████████████████████████████████████████▉ \| 1226/2076 [00:18<00:34, 25.00it/s] Loading weights: 60%\|██████████████████████████████████████████████████████████████████████████████▌ \| 1236/2076 [00:18<00:26, 31.77it/s] Loading weights: 60%\|██████████████████████████████████████████████████████████████████████████████▊ \| 1240/2076 [00:18<00:34, 24.48it/s] Loading weights: 60%\|███████████████████████████████████████████████████████████████████████████████▎ \| 1247/2076 [00:18<00:27, 30.31it/s] Loading weights: 60%\|███████████████████████████████████████████████████████████████████████████████▌ \| 1251/2076 [00:18<00:33, 24.70it/s] Loading weights: 60%\|███████████████████████████████████████████████████████████████████████████████▊ \| 1255/2076 [00:19<00:32, 25.20it/s] Loading weights: 61%\|████████████████████████████████████████████████████████████████████████████████▎ \| 1264/2076 [00:19<00:26, 30.67it/s] Loading weights: 61%\|████████████████████████████████████████████████████████████████████████████████▌ \| 1268/2076 [00:19<00:34, 23.41it/s] Loading weights: 62%\|█████████████████████████████████████████████████████████████████████████████████▎ \| 1278/2076 [00:19<00:26, 30.22it/s] Loading weights: 62%\|█████████████████████████████████████████████████████████████████████████████████▌ \| 1282/2076 [00:20<00:31, 24.81it/s] Loading weights: 62%\|██████████████████████████████████████████████████████████████████████████████████▏ \| 1292/2076 [00:20<00:25, 31.02it/s] Loading weights: 62%\|██████████████████████████████████████████████████████████████████████████████████▍ \| 1296/2076 [00:20<00:32, 23.64it/s] Loading weights: 63%\|███████████████████████████████████████████████████████████████████████████████████ \| 1306/2076 [00:20<00:25, 30.34it/s] Loading weights: 63%\|███████████████████████████████████████████████████████████████████████████████████▎ \| 1310/2076 [00:21<00:32, 23.87it/s] Loading weights: 64%\|███████████████████████████████████████████████████████████████████████████████████▉ \| 1320/2076 [00:21<00:25, 29.52it/s] Loading weights: 64%\|████████████████████████████████████████████████████████████████████████████████████▏ \| 1324/2076 [00:21<00:30, 24.38it/s] Loading weights: 64%\|████████████████████████████████████████████████████████████████████████████████████▋ \| 1331/2076 [00:21<00:24, 30.50it/s] Loading weights: 64%\|████████████████████████████████████████████████████████████████████████████████████▉ \| 1335/2076 [00:22<00:30, 24.15it/s] Loading weights: 64%\|█████████████████████████████████████████████████████████████████████████████████████▏ \| 1339/2076 [00:22<00:29, 25.27it/s] Loading weights: 65%\|█████████████████████████████████████████████████████████████████████████████████████▋ \| 1348/2076 [00:22<00:22, 32.23it/s] Loading weights: 65%\|█████████████████████████████████████████████████████████████████████████████████████▉ \| 1352/2076 [00:22<00:29, 24.95it/s] Loading weights: 66%\|██████████████████████████████████████████████████████████████████████████████████████▌ \| 1362/2076 [00:22<00:22, 32.00it/s] Loading weights: 66%\|██████████████████████████████████████████████████████████████████████████████████████▊ \| 1366/2076 [00:23<00:28, 25.10it/s] Loading weights: 66%\|███████████████████████████████████████████████████████████████████████████████████████▍ \| 1376/2076 [00:23<00:21, 31.93it/s] Loading weights: 66%\|███████████████████████████████████████████████████████████████████████████████████████▋ \| 1380/2076 [00:23<00:27, 25.08it/s] Loading weights: 67%\|████████████████████████████████████████████████████████████████████████████████████████▍ \| 1390/2076 [00:23<00:21, 32.17it/s] Loading weights: 67%\|████████████████████████████████████████████████████████████████████████████████████████▋ \| 1394/2076 [00:24<00:27, 24.57it/s] Loading weights: 68%\|█████████████████████████████████████████████████████████████████████████████████████████▎ \| 1404/2076 [00:24<00:21, 31.22it/s] Loading weights: 68%\|█████████████████████████████████████████████████████████████████████████████████████████▌ \| 1408/2076 [00:24<00:26, 25.40it/s] Loading weights: 68%\|█████████████████████████████████████████████████████████████████████████████████████████▉ \| 1415/2076 [00:24<00:21, 30.97it/s] Loading weights: 68%\|██████████████████████████████████████████████████████████████████████████████████████████▏ \| 1419/2076 [00:24<00:22, 29.70it/s] Loading weights: 76%\|███████████████████████████████████████████████████████████████████████████████████████████████████ \| 1569/2076 [00:24<00:01, 293.08it/s] Loading weights: 83%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 1714/2076 [00:25<00:00, 528.51it/s] Loading weights: 90%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 1878/2076 [00:25<00:00, 774.40it/s] Loading weights: 98%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 2042/2076 [00:25<00:00, 981.87it/s] Loading weights: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2076/2076 [00:25<00:00, 81.97it/s]
	[2026-06-14 14:12:52,761] [INFO] [axolotl.loaders.model._prepare_model_for_quantization:977] [PID:3393] converting PEFT model w/ prepare_model_for_kbit_training
	[2026-06-14 14:12:52,780] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:433] [PID:3393] Converting modules to torch.bfloat16
	[2026-06-14 14:12:52,965] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] Memory usage after model load 22.419GB (+22.419GB allocated, +28.939GB reserved)
	trainable params: 34,881,536 \|\| all params: 7,975,982,368 \|\| trainable%: 0.4373
	[2026-06-14 14:12:53,476] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] after adapters 10.827GB (+10.827GB allocated, +29.068GB reserved)
	[2026-06-14 14:12:55,145] [INFO] [axolotl.utils.freeze.freeze_mm_modules:49] [PID:3393] freeze_mm_modules: froze 0 vision/audio parameters
	[2026-06-14 14:12:56,196] [INFO] [axolotl.core.trainers.mixins.layer_offloading.__init__:291] [PID:3393] Layer parameter offloading enabled
	[2026-06-14 14:12:56,197] [WARNING] [axolotl.core.trainers.mixins.layer_offloading.__init__:73] [PID:3393] LayerOffloadManager: no decoder layers found, offloading disabled
	[2026-06-14 14:12:56,197] [INFO] [axolotl.train.save_initial_configs:450] [PID:3393] Pre-saving adapter config to ./outputs/Jacob-2-E4B...
	[2026-06-14 14:12:56,198] [INFO] [axolotl.train.save_initial_configs:454] [PID:3393] Pre-saving tokenizer to ./outputs/Jacob-2-E4B...
	[2026-06-14 14:12:56,696] [INFO] [axolotl.train.save_initial_configs:459] [PID:3393] Pre-saving model config to ./outputs/Jacob-2-E4B...
	[2026-06-14 14:12:56,700] [INFO] [axolotl.train.save_initial_configs:463] [PID:3393] Pre-saving processor to ./outputs/Jacob-2-E4B...
	[2026-06-14 14:12:57,113] [INFO] [axolotl.train.execute_training:226] [PID:3393] Starting trainer...
	0%\| \| 0/263 [00:00<?, ?it/s][2026-06-14 14:13:00,854] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:13:01,453] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:13:11,380] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:13:46,333] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:13:46,745] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:14:09,129] [INFO] [axolotl.kernels.autotune_telemetry.on_step_end:133] [PID:3393] Reported 2 fused-rope kernel autotune config(s) to telemetry.
	0%\|▌ \| 1/263 [01:11<5:11:11, 71.27s/it] {'loss': '1.576', 'grad_norm': '0.5319', 'learning_rate': '0', 'ppl': '4.836', 'memory/max_active (GiB)': '35.37', 'memory/max_allocated (GiB)': '35.37', 'memory/device_reserved (GiB)': '44.53', 'tokens/trainable': 13646, 'tokens/total': 34320, 'epoch': '0.003815'}
	0%\|▌ \| 1/263 [01:11<5:11:11, 71.27s/it] 1%\|█▏ \| 2/263 [02:07<4:31:48, 62.49s/it] {'loss': '1.541', 'grad_norm': '0.4915', 'learning_rate': '7.692e-06', 'ppl': '4.671', 'memory/max_active (GiB)': '37.15', 'memory/max_allocated (GiB)': '37.15', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '272.8', 'tokens/trainable': 29007, 'tokens/total': 65680, 'epoch': '0.00763'}
	1%\|█▏ \| 2/263 [02:07<4:31:48, 62.49s/it] 1%\|█▋ \| 3/263 [02:52<3:55:45, 54.40s/it] {'loss': '1.676', 'grad_norm': '0.5334', 'learning_rate': '1.538e-05', 'ppl': '5.345', 'memory/max_active (GiB)': '26.41', 'memory/max_allocated (GiB)': '26.41', 'memory/device_reserved (GiB)': '34.21', 'tokens/train_per_sec_per_gpu': '241.3', 'tokens/trainable': 39814, 'tokens/total': 89258, 'epoch': '0.01144'}
	1%\|█▋ \| 3/263 [02:52<3:55:45, 54.40s/it] 2%\|██▎ \| 4/263 [03:26<3:20:53, 46.54s/it] {'loss': '1.701', 'grad_norm': '0.6123', 'learning_rate': '2.308e-05', 'ppl': '5.478', 'memory/max_active (GiB)': '22.96', 'memory/max_allocated (GiB)': '22.96', 'memory/device_reserved (GiB)': '25.77', 'tokens/train_per_sec_per_gpu': '256.6', 'tokens/trainable': 48660, 'tokens/total': 108024, 'epoch': '0.01526'}
	2%\|██▎ \| 4/263 [03:26<3:20:53, 46.54s/it] 2%\|██▊ \| 5/263 [04:08<3:12:35, 44.79s/it] {'loss': '1.721', 'grad_norm': '0.6678', 'learning_rate': '3.077e-05', 'ppl': '5.59', 'memory/max_active (GiB)': '24.42', 'memory/max_allocated (GiB)': '24.42', 'memory/device_reserved (GiB)': '31.26', 'tokens/train_per_sec_per_gpu': '200.1', 'tokens/trainable': 57004, 'tokens/total': 127360, 'epoch': '0.01907'}
	2%\|██▊ \| 5/263 [04:08<3:12:35, 44.79s/it] 2%\|███▍ \| 6/263 [05:00<3:22:41, 47.32s/it] {'loss': '1.441', 'grad_norm': '0.5136', 'learning_rate': '3.846e-05', 'ppl': '4.225', 'memory/max_active (GiB)': '43.28', 'memory/max_allocated (GiB)': '43.28', 'memory/device_reserved (GiB)': '60.49', 'tokens/train_per_sec_per_gpu': '340.5', 'tokens/trainable': 74786, 'tokens/total': 165128, 'epoch': '0.02289'}
	2%\|███▍ \| 6/263 [05:00<3:22:41, 47.32s/it] 3%\|████ \| 7/263 [05:43<3:15:55, 45.92s/it] {'loss': '1.527', 'grad_norm': '0.5568', 'learning_rate': '4.615e-05', 'ppl': '4.603', 'memory/max_active (GiB)': '38.81', 'memory/max_allocated (GiB)': '38.81', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '290.1', 'tokens/trainable': 87272, 'tokens/total': 194790, 'epoch': '0.0267'}
	3%\|████ \| 7/263 [05:43<3:15:55, 45.92s/it] 3%\|████▌ \| 8/263 [06:32<3:19:30, 46.94s/it] {'loss': '1.629', 'grad_norm': '0.7715', 'learning_rate': '5.385e-05', 'ppl': '5.098', 'memory/max_active (GiB)': '34.9', 'memory/max_allocated (GiB)': '34.9', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '247.6', 'tokens/trainable': 99438, 'tokens/total': 223594, 'epoch': '0.03052'}
	3%\|████▌ \| 8/263 [06:32<3:19:30, 46.94s/it] 3%\|█████▏ \| 9/263 [07:20<3:19:53, 47.22s/it] {'loss': '1.545', 'grad_norm': '0.7755', 'learning_rate': '6.154e-05', 'ppl': '4.688', 'memory/max_active (GiB)': '42.92', 'memory/max_allocated (GiB)': '42.92', 'memory/device_reserved (GiB)': '59.79', 'tokens/train_per_sec_per_gpu': '298.1', 'tokens/trainable': 113695, 'tokens/total': 256322, 'epoch': '0.03433'}
	3%\|█████▏ \| 9/263 [07:20<3:19:53, 47.22s/it] 4%\|█████▋ \| 10/263 [08:10<3:21:47, 47.86s/it] {'loss': '1.348', 'grad_norm': '0.6723', 'learning_rate': '6.923e-05', 'ppl': '3.848', 'memory/max_active (GiB)': '28.92', 'memory/max_allocated (GiB)': '28.92', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '238.3', 'tokens/trainable': 125442, 'tokens/total': 282552, 'epoch': '0.03815'}
	4%\|█████▋ \| 10/263 [08:10<3:21:47, 47.86s/it] 4%\|██████▎ \| 11/263 [08:50<3:10:58, 45.47s/it] {'loss': '1.337', 'grad_norm': '0.701', 'learning_rate': '7.692e-05', 'ppl': '3.808', 'memory/max_active (GiB)': '25.92', 'memory/max_allocated (GiB)': '25.92', 'memory/device_reserved (GiB)': '33.79', 'tokens/train_per_sec_per_gpu': '275.6', 'tokens/trainable': 136483, 'tokens/total': 306800, 'epoch': '0.04196'}
	4%\|██████▎ \| 11/263 [08:50<3:10:58, 45.47s/it] 5%\|██████▊ \| 12/263 [09:22<2:53:32, 41.48s/it] {'loss': '1.188', 'grad_norm': '0.5657', 'learning_rate': '8.462e-05', 'ppl': '3.28', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.74', 'tokens/train_per_sec_per_gpu': '364.8', 'tokens/trainable': 148289, 'tokens/total': 329988, 'epoch': '0.04578'}
	5%\|██████▊ \| 12/263 [09:22<2:53:32, 41.48s/it] 5%\|███████▍ \| 13/263 [10:18<3:11:41, 46.01s/it] {'loss': '1.115', 'grad_norm': '0.5181', 'learning_rate': '9.231e-05', 'ppl': '3.048', 'memory/max_active (GiB)': '48.15', 'memory/max_allocated (GiB)': '48.15', 'memory/device_reserved (GiB)': '67.89', 'tokens/train_per_sec_per_gpu': '312.6', 'tokens/trainable': 165922, 'tokens/total': 365046, 'epoch': '0.04959'}
	5%\|███████▍ \| 13/263 [10:18<3:11:41, 46.01s/it] 5%\|███████▉ \| 14/263 [11:08<3:15:29, 47.11s/it] {'loss': '1.097', 'grad_norm': '0.5891', 'learning_rate': '0.0001', 'ppl': '2.995', 'memory/max_active (GiB)': '37.46', 'memory/max_allocated (GiB)': '37.46', 'memory/device_reserved (GiB)': '47.39', 'tokens/train_per_sec_per_gpu': '291.3', 'tokens/trainable': 180382, 'tokens/total': 398168, 'epoch': '0.05341'}
	5%\|███████▉ \| 14/263 [11:08<3:15:29, 47.11s/it] 6%\|████████▌ \| 15/263 [11:59<3:19:12, 48.19s/it] {'loss': '1.005', 'grad_norm': '0.5228', 'learning_rate': '0.0001077', 'ppl': '2.732', 'memory/max_active (GiB)': '39.68', 'memory/max_allocated (GiB)': '39.68', 'memory/device_reserved (GiB)': '55.06', 'tokens/train_per_sec_per_gpu': '245.6', 'tokens/trainable': 192840, 'tokens/total': 425746, 'epoch': '0.05722'}
	6%\|████████▌ \| 15/263 [11:59<3:19:12, 48.19s/it] 6%\|█████████▏ \| 16/263 [12:52<3:24:35, 49.70s/it] {'loss': '0.871', 'grad_norm': '0.36', 'learning_rate': '0.0001154', 'ppl': '2.389', 'memory/max_active (GiB)': '50.71', 'memory/max_allocated (GiB)': '50.71', 'memory/device_reserved (GiB)': '71.7', 'tokens/train_per_sec_per_gpu': '322.6', 'tokens/trainable': 209997, 'tokens/total': 464790, 'epoch': '0.06104'}
	6%\|█████████▏ \| 16/263 [12:52<3:24:35, 49.70s/it] 6%\|█████████▋ \| 17/263 [13:45<3:27:34, 50.63s/it] {'loss': '0.98', 'grad_norm': '0.5478', 'learning_rate': '0.0001231', 'ppl': '2.664', 'memory/max_active (GiB)': '59.65', 'memory/max_allocated (GiB)': '59.65', 'memory/device_reserved (GiB)': '77.29', 'tokens/train_per_sec_per_gpu': '331.7', 'tokens/trainable': 227508, 'tokens/total': 502040, 'epoch': '0.06485'}
	6%\|█████████▋ \| 17/263 [13:45<3:27:34, 50.63s/it] 7%\|██████████▎ \| 18/263 [14:33<3:24:00, 49.96s/it] {'loss': '1.008', 'grad_norm': '0.3831', 'learning_rate': '0.0001308', 'ppl': '2.74', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '52.24', 'tokens/train_per_sec_per_gpu': '315.6', 'tokens/trainable': 242788, 'tokens/total': 533846, 'epoch': '0.06867'}
	7%\|██████████▎ \| 18/263 [14:33<3:24:00, 49.96s/it] 7%\|██████████▊ \| 19/263 [15:13<3:10:30, 46.85s/it] {'loss': '0.9226', 'grad_norm': '0.4333', 'learning_rate': '0.0001385', 'ppl': '2.516', 'memory/max_active (GiB)': '41.82', 'memory/max_allocated (GiB)': '41.82', 'memory/device_reserved (GiB)': '58.38', 'tokens/train_per_sec_per_gpu': '382.9', 'tokens/trainable': 257947, 'tokens/total': 566334, 'epoch': '0.07248'}
	7%\|██████████▊ \| 19/263 [15:13<3:10:30, 46.85s/it][2026-06-14 14:28:28,050] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	[2026-06-14 14:28:28,323] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
	return fn(args, *kwargs)

	8%\|███████████▍ \| 20/263 [16:11<3:23:38, 50.28s/it] {'loss': '0.9231', 'grad_norm': '0.4565', 'learning_rate': '0.0001462', 'ppl': '2.517', 'memory/max_active (GiB)': '37.11', 'memory/max_allocated (GiB)': '37.11', 'memory/device_reserved (GiB)': '46.94', 'tokens/train_per_sec_per_gpu': '361.3', 'tokens/trainable': 279009, 'tokens/total': 610288, 'epoch': '0.0763'}
	8%\|███████████▍ \| 20/263 [16:11<3:23:38, 50.28s/it] 8%\|███████████▉ \| 21/263 [17:02<3:23:17, 50.40s/it] {'loss': '0.9047', 'grad_norm': '0.5018', 'learning_rate': '0.0001538', 'ppl': '2.471', 'memory/max_active (GiB)': '33.83', 'memory/max_allocated (GiB)': '33.83', 'memory/device_reserved (GiB)': '42.33', 'tokens/train_per_sec_per_gpu': '243', 'tokens/trainable': 291324, 'tokens/total': 637560, 'epoch': '0.08011'}
	8%\|███████████▉ \| 21/263 [17:02<3:23:17, 50.40s/it] 8%\|████████████▌ \| 22/263 [17:44<3:12:51, 48.01s/it] {'loss': '0.892', 'grad_norm': '0.4374', 'learning_rate': '0.0001615', 'ppl': '2.44', 'memory/max_active (GiB)': '31.34', 'memory/max_allocated (GiB)': '31.34', 'memory/device_reserved (GiB)': '41.63', 'tokens/train_per_sec_per_gpu': '440', 'tokens/trainable': 309998, 'tokens/total': 672066, 'epoch': '0.08393'}
	8%\|████████████▌ \| 22/263 [17:44<3:12:51, 48.01s/it] 9%\|█████████████ \| 23/263 [18:16<2:52:03, 43.01s/it] {'loss': '1', 'grad_norm': '0.3918', 'learning_rate': '0.0001692', 'ppl': '2.719', 'memory/max_active (GiB)': '27.48', 'memory/max_allocated (GiB)': '27.48', 'memory/device_reserved (GiB)': '35.84', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 322055, 'tokens/total': 698356, 'epoch': '0.08774'}
	9%\|█████████████ \| 23/263 [18:16<2:52:03, 43.01s/it] 9%\|█████████████▋ \| 24/263 [18:42<2:31:15, 37.97s/it] {'loss': '0.9319', 'grad_norm': '0.3207', 'learning_rate': '0.0001769', 'ppl': '2.539', 'memory/max_active (GiB)': '22.66', 'memory/max_allocated (GiB)': '22.66', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '382.2', 'tokens/trainable': 332075, 'tokens/total': 718552, 'epoch': '0.09156'}
	9%\|█████████████▋ \| 24/263 [18:42<2:31:15, 37.97s/it] 10%\|██████████████▎ \| 25/263 [19:26<2:38:17, 39.90s/it] {'loss': '0.8685', 'grad_norm': '0.2337', 'learning_rate': '0.0001846', 'ppl': '2.383', 'memory/max_active (GiB)': '31.11', 'memory/max_allocated (GiB)': '31.11', 'memory/device_reserved (GiB)': '38.38', 'tokens/train_per_sec_per_gpu': '287', 'tokens/trainable': 344822, 'tokens/total': 746628, 'epoch': '0.09537'}
	10%\|██████████████▎ \| 25/263 [19:26<2:38:17, 39.90s/it] 10%\|██████████████▊ \| 26/263 [20:17<2:50:36, 43.19s/it] {'loss': '0.9079', 'grad_norm': '0.2105', 'learning_rate': '0.0001923', 'ppl': '2.479', 'memory/max_active (GiB)': '27.1', 'memory/max_allocated (GiB)': '27.1', 'memory/device_reserved (GiB)': '35.33', 'tokens/train_per_sec_per_gpu': '241.9', 'tokens/trainable': 357129, 'tokens/total': 775900, 'epoch': '0.09919'}
	10%\|██████████████▊ \| 26/263 [20:17<2:50:36, 43.19s/it] 10%\|███████████████▍ \| 27/263 [20:59<2:48:53, 42.94s/it] {'loss': '0.9097', 'grad_norm': '0.2115', 'learning_rate': '0.0002', 'ppl': '2.484', 'memory/max_active (GiB)': '45.8', 'memory/max_allocated (GiB)': '45.8', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '297.6', 'tokens/trainable': 369730, 'tokens/total': 802730, 'epoch': '0.103'}
	10%\|███████████████▍ \| 27/263 [20:59<2:48:53, 42.94s/it] 11%\|███████████████▉ \| 28/263 [21:41<2:46:11, 42.43s/it] {'loss': '0.8107', 'grad_norm': '0.1699', 'learning_rate': '0.0002', 'ppl': '2.249', 'memory/max_active (GiB)': '39.66', 'memory/max_allocated (GiB)': '39.66', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '399.8', 'tokens/trainable': 386221, 'tokens/total': 839556, 'epoch': '0.1068'}
	11%\|███████████████▉ \| 28/263 [21:41<2:46:11, 42.43s/it] 11%\|████████████████▌ \| 29/263 [22:13<2:33:22, 39.33s/it] {'loss': '0.8618', 'grad_norm': '0.188', 'learning_rate': '0.0002', 'ppl': '2.367', 'memory/max_active (GiB)': '30.9', 'memory/max_allocated (GiB)': '30.9', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '358.7', 'tokens/trainable': 397730, 'tokens/total': 866216, 'epoch': '0.1106'}
	11%\|████████████████▌ \| 29/263 [22:13<2:33:22, 39.33s/it] 11%\|█████████████████ \| 30/263 [22:51<2:31:11, 38.93s/it] {'loss': '0.7588', 'grad_norm': '0.1495', 'learning_rate': '0.0001999', 'ppl': '2.136', 'memory/max_active (GiB)': '31.81', 'memory/max_allocated (GiB)': '31.81', 'memory/device_reserved (GiB)': '42.5', 'tokens/train_per_sec_per_gpu': '420', 'tokens/trainable': 413695, 'tokens/total': 898918, 'epoch': '0.1144'}
	11%\|█████████████████ \| 30/263 [22:51<2:31:11, 38.93s/it] 12%\|█████████████████▋ \| 31/263 [23:27<2:27:18, 38.10s/it] {'loss': '0.7096', 'grad_norm': '0.1585', 'learning_rate': '0.0001999', 'ppl': '2.033', 'memory/max_active (GiB)': '26.02', 'memory/max_allocated (GiB)': '26.02', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '283.3', 'tokens/trainable': 423936, 'tokens/total': 923272, 'epoch': '0.1183'}
	12%\|█████████████████▋ \| 31/263 [23:27<2:27:18, 38.10s/it] 12%\|██████████████████▎ \| 32/263 [24:16<2:39:47, 41.50s/it] {'loss': '0.7129', 'grad_norm': '0.1396', 'learning_rate': '0.0001998', 'ppl': '2.04', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.17', 'tokens/train_per_sec_per_gpu': '434.5', 'tokens/trainable': 445423, 'tokens/total': 968072, 'epoch': '0.1221'}
	12%\|██████████████████▎ \| 32/263 [24:16<2:39:47, 41.50s/it] 13%\|██████████████████▊ \| 33/263 [24:52<2:32:16, 39.72s/it] {'loss': '0.8433', 'grad_norm': '0.1598', 'learning_rate': '0.0001997', 'ppl': '2.324', 'memory/max_active (GiB)': '28.78', 'memory/max_allocated (GiB)': '28.78', 'memory/device_reserved (GiB)': '61.66', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 458707, 'tokens/total': 996758, 'epoch': '0.1259'}
	13%\|██████████████████▊ \| 33/263 [24:52<2:32:16, 39.72s/it] 13%\|███████████████████▍ \| 34/263 [25:44<2:46:13, 43.55s/it] {'loss': '0.7034', 'grad_norm': '0.1374', 'learning_rate': '0.0001996', 'ppl': '2.021', 'memory/max_active (GiB)': '47.51', 'memory/max_allocated (GiB)': '47.51', 'memory/device_reserved (GiB)': '67.08', 'tokens/train_per_sec_per_gpu': '380.5', 'tokens/trainable': 478680, 'tokens/total': 1041764, 'epoch': '0.1297'}
	13%\|███████████████████▍ \| 34/263 [25:44<2:46:13, 43.55s/it] 13%\|███████████████████▉ \| 35/263 [26:23<2:39:55, 42.09s/it] {'loss': '0.705', 'grad_norm': '0.1833', 'learning_rate': '0.0001994', 'ppl': '2.024', 'memory/max_active (GiB)': '35.78', 'memory/max_allocated (GiB)': '35.78', 'memory/device_reserved (GiB)': '45.08', 'tokens/train_per_sec_per_gpu': '333.5', 'tokens/trainable': 491578, 'tokens/total': 1071866, 'epoch': '0.1335'}
	13%\|███████████████████▉ \| 35/263 [26:23<2:39:55, 42.09s/it] 14%\|████████████████████▌ \| 36/263 [27:25<3:01:14, 47.90s/it] {'loss': '0.7842', 'grad_norm': '0.129', 'learning_rate': '0.0001993', 'ppl': '2.191', 'memory/max_active (GiB)': '60.96', 'memory/max_allocated (GiB)': '60.96', 'memory/device_reserved (GiB)': '68.99', 'tokens/train_per_sec_per_gpu': '468.1', 'tokens/trainable': 520355, 'tokens/total': 1124412, 'epoch': '0.1373'}
	14%\|████████████████████▌ \| 36/263 [27:25<3:01:14, 47.90s/it] 14%\|█████████████████████ \| 37/263 [27:58<2:43:52, 43.51s/it] {'loss': '0.81', 'grad_norm': '0.164', 'learning_rate': '0.0001991', 'ppl': '2.248', 'memory/max_active (GiB)': '40.05', 'memory/max_allocated (GiB)': '40.05', 'memory/device_reserved (GiB)': '55.65', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 533138, 'tokens/total': 1152862, 'epoch': '0.1412'}
	14%\|█████████████████████ \| 37/263 [27:58<2:43:52, 43.51s/it] 14%\|█████████████████████▋ \| 38/263 [28:42<2:43:52, 43.70s/it] {'loss': '0.6781', 'grad_norm': '0.1564', 'learning_rate': '0.0001989', 'ppl': '1.97', 'memory/max_active (GiB)': '49.18', 'memory/max_allocated (GiB)': '49.18', 'memory/device_reserved (GiB)': '69.5', 'tokens/train_per_sec_per_gpu': '413.4', 'tokens/trainable': 551391, 'tokens/total': 1190852, 'epoch': '0.145'}
	14%\|█████████████████████▋ \| 38/263 [28:42<2:43:52, 43.70s/it] 15%\|██████████████████████▏ \| 39/263 [29:17<2:33:07, 41.01s/it] {'loss': '0.8666', 'grad_norm': '0.1817', 'learning_rate': '0.0001987', 'ppl': '2.379', 'memory/max_active (GiB)': '42.54', 'memory/max_allocated (GiB)': '42.54', 'memory/device_reserved (GiB)': '59.24', 'tokens/train_per_sec_per_gpu': '349.9', 'tokens/trainable': 563547, 'tokens/total': 1219812, 'epoch': '0.1488'}
	15%\|██████████████████████▏ \| 39/263 [29:17<2:33:07, 41.01s/it] 15%\|██████████████████████▊ \| 40/263 [29:53<2:27:38, 39.72s/it] {'loss': '0.7456', 'grad_norm': '0.1504', 'learning_rate': '0.0001985', 'ppl': '2.108', 'memory/max_active (GiB)': '37.56', 'memory/max_allocated (GiB)': '37.56', 'memory/device_reserved (GiB)': '47.6', 'tokens/train_per_sec_per_gpu': '429.8', 'tokens/trainable': 579326, 'tokens/total': 1251662, 'epoch': '0.1526'}
	15%\|██████████████████████▊ \| 40/263 [29:53<2:27:38, 39.72s/it] 16%\|███████████████████████▍ \| 41/263 [30:30<2:23:42, 38.84s/it] {'loss': '0.7436', 'grad_norm': '0.1403', 'learning_rate': '0.0001983', 'ppl': '2.103', 'memory/max_active (GiB)': '35.11', 'memory/max_allocated (GiB)': '35.11', 'memory/device_reserved (GiB)': '44.3', 'tokens/train_per_sec_per_gpu': '410.9', 'tokens/trainable': 594442, 'tokens/total': 1285326, 'epoch': '0.1564'}
	16%\|███████████████████████▍ \| 41/263 [30:30<2:23:42, 38.84s/it] 16%\|███████████████████████▉ \| 42/263 [31:12<2:26:40, 39.82s/it] {'loss': '0.752', 'grad_norm': '0.1654', 'learning_rate': '0.000198', 'ppl': '2.121', 'memory/max_active (GiB)': '31.65', 'memory/max_allocated (GiB)': '31.65', 'memory/device_reserved (GiB)': '41.45', 'tokens/train_per_sec_per_gpu': '317.2', 'tokens/trainable': 607800, 'tokens/total': 1315504, 'epoch': '0.1602'}
	16%\|███████████████████████▉ \| 42/263 [31:12<2:26:40, 39.82s/it] 16%\|████████████████████████▌ \| 43/263 [31:54<2:28:01, 40.37s/it] {'loss': '0.823', 'grad_norm': '0.1637', 'learning_rate': '0.0001978', 'ppl': '2.277', 'memory/max_active (GiB)': '26.15', 'memory/max_allocated (GiB)': '26.15', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '277.6', 'tokens/trainable': 619363, 'tokens/total': 1338730, 'epoch': '0.164'}
	16%\|████████████████████████▌ \| 43/263 [31:54<2:28:01, 40.37s/it] 17%\|█████████████████████████ \| 44/263 [32:34<2:26:37, 40.17s/it] {'loss': '0.7822', 'grad_norm': '0.1594', 'learning_rate': '0.0001975', 'ppl': '2.186', 'memory/max_active (GiB)': '36.38', 'memory/max_allocated (GiB)': '36.38', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '375.6', 'tokens/trainable': 634277, 'tokens/total': 1370806, 'epoch': '0.1679'}
	17%\|█████████████████████████ \| 44/263 [32:34<2:26:37, 40.17s/it] 17%\|█████████████████████████▋ \| 45/263 [33:11<2:23:02, 39.37s/it] {'loss': '0.7665', 'grad_norm': '0.1315', 'learning_rate': '0.0001972', 'ppl': '2.152', 'memory/max_active (GiB)': '32.6', 'memory/max_allocated (GiB)': '32.6', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '406.5', 'tokens/trainable': 649519, 'tokens/total': 1403632, 'epoch': '0.1717'}
	17%\|█████████████████████████▋ \| 45/263 [33:11<2:23:02, 39.37s/it] 17%\|██████████████████████████▏ \| 46/263 [33:41<2:12:01, 36.50s/it] {'loss': '0.7592', 'grad_norm': '0.1396', 'learning_rate': '0.0001968', 'ppl': '2.137', 'memory/max_active (GiB)': '22.71', 'memory/max_allocated (GiB)': '22.71', 'memory/device_reserved (GiB)': '33.91', 'tokens/train_per_sec_per_gpu': '378.9', 'tokens/trainable': 660815, 'tokens/total': 1426008, 'epoch': '0.1755'}
	17%\|██████████████████████████▏ \| 46/263 [33:41<2:12:01, 36.50s/it] 18%\|██████████████████████████▊ \| 47/263 [34:20<2:14:37, 37.40s/it] {'loss': '0.7492', 'grad_norm': '0.1523', 'learning_rate': '0.0001965', 'ppl': '2.115', 'memory/max_active (GiB)': '30.88', 'memory/max_allocated (GiB)': '30.88', 'memory/device_reserved (GiB)': '40.8', 'tokens/train_per_sec_per_gpu': '300.6', 'tokens/trainable': 672685, 'tokens/total': 1452446, 'epoch': '0.1793'}
	18%\|██████████████████████████▊ \| 47/263 [34:20<2:14:37, 37.40s/it] 18%\|███████████████████████████▍ \| 48/263 [35:11<2:28:02, 41.32s/it] {'loss': '0.7758', 'grad_norm': '0.1287', 'learning_rate': '0.0001962', 'ppl': '2.172', 'memory/max_active (GiB)': '46.66', 'memory/max_allocated (GiB)': '46.66', 'memory/device_reserved (GiB)': '65.45', 'tokens/train_per_sec_per_gpu': '309.9', 'tokens/trainable': 688320, 'tokens/total': 1484130, 'epoch': '0.1831'}
	18%\|███████████████████████████▍ \| 48/263 [35:11<2:28:02, 41.32s/it] 19%\|███████████████████████████▉ \| 49/263 [35:50<2:24:35, 40.54s/it] {'loss': '0.723', 'grad_norm': '0.1277', 'learning_rate': '0.0001958', 'ppl': '2.061', 'memory/max_active (GiB)': '31.37', 'memory/max_allocated (GiB)': '31.37', 'memory/device_reserved (GiB)': '41.67', 'tokens/train_per_sec_per_gpu': '351.3', 'tokens/trainable': 701926, 'tokens/total': 1512564, 'epoch': '0.1869'}
	19%\|███████████████████████████▉ \| 49/263 [35:50<2:24:35, 40.54s/it] 19%\|████████████████████████████▌ \| 50/263 [36:34<2:27:53, 41.66s/it] {'loss': '0.6734', 'grad_norm': '0.1139', 'learning_rate': '0.0001954', 'ppl': '1.961', 'memory/max_active (GiB)': '38.79', 'memory/max_allocated (GiB)': '38.79', 'memory/device_reserved (GiB)': '53.5', 'tokens/train_per_sec_per_gpu': '458.3', 'tokens/trainable': 722216, 'tokens/total': 1551038, 'epoch': '0.1907'}
	19%\|████████████████████████████▌ \| 50/263 [36:34<2:27:53, 41.66s/it] 19%\|█████████████████████████████ \| 51/263 [37:28<2:40:22, 45.39s/it] {'loss': '0.7031', 'grad_norm': '0.1693', 'learning_rate': '0.000195', 'ppl': '2.02', 'memory/max_active (GiB)': '45.1', 'memory/max_allocated (GiB)': '45.1', 'memory/device_reserved (GiB)': '63.62', 'tokens/train_per_sec_per_gpu': '270.4', 'tokens/trainable': 736843, 'tokens/total': 1585540, 'epoch': '0.1946'}
	19%\|█████████████████████████████ \| 51/263 [37:28<2:40:22, 45.39s/it] 20%\|█████████████████████████████▋ \| 52/263 [38:10<2:36:32, 44.52s/it] {'loss': '0.7569', 'grad_norm': '0.1728', 'learning_rate': '0.0001946', 'ppl': '2.132', 'memory/max_active (GiB)': '20.75', 'memory/max_allocated (GiB)': '20.75', 'memory/device_reserved (GiB)': '25.67', 'tokens/train_per_sec_per_gpu': '179.7', 'tokens/trainable': 744474, 'tokens/total': 1602602, 'epoch': '0.1984'}
	20%\|█████████████████████████████▋ \| 52/263 [38:10<2:36:32, 44.52s/it] 20%\|██████████████████████████████▏ \| 53/263 [38:53<2:33:22, 43.82s/it] {'loss': '0.81', 'grad_norm': '0.1633', 'learning_rate': '0.0001941', 'ppl': '2.248', 'memory/max_active (GiB)': '27.01', 'memory/max_allocated (GiB)': '27.01', 'memory/device_reserved (GiB)': '35.24', 'tokens/train_per_sec_per_gpu': '223.7', 'tokens/trainable': 753915, 'tokens/total': 1625426, 'epoch': '0.2022'}
	20%\|██████████████████████████████▏ \| 53/263 [38:53<2:33:22, 43.82s/it] 21%\|██████████████████████████████▊ \| 54/263 [39:42<2:37:58, 45.35s/it] {'loss': '0.7126', 'grad_norm': '0.1385', 'learning_rate': '0.0001937', 'ppl': '2.039', 'memory/max_active (GiB)': '37.95', 'memory/max_allocated (GiB)': '37.95', 'memory/device_reserved (GiB)': '48.03', 'tokens/train_per_sec_per_gpu': '320.1', 'tokens/trainable': 769572, 'tokens/total': 1660362, 'epoch': '0.206'}
	21%\|██████████████████████████████▊ \| 54/263 [39:42<2:37:58, 45.35s/it] 21%\|███████████████████████████████▎ \| 55/263 [40:31<2:41:57, 46.72s/it] {'loss': '0.6959', 'grad_norm': '0.1312', 'learning_rate': '0.0001932', 'ppl': '2.006', 'memory/max_active (GiB)': '45.85', 'memory/max_allocated (GiB)': '45.85', 'memory/device_reserved (GiB)': '64.3', 'tokens/train_per_sec_per_gpu': '344.5', 'tokens/trainable': 786763, 'tokens/total': 1697704, 'epoch': '0.2098'}
	21%\|███████████████████████████████▎ \| 55/263 [40:31<2:41:57, 46.72s/it] 21%\|███████████████████████████████▉ \| 56/263 [41:15<2:37:46, 45.73s/it] {'loss': '0.7052', 'grad_norm': '0.1675', 'learning_rate': '0.0001927', 'ppl': '2.024', 'memory/max_active (GiB)': '28.46', 'memory/max_allocated (GiB)': '28.46', 'memory/device_reserved (GiB)': '37.35', 'tokens/train_per_sec_per_gpu': '235.5', 'tokens/trainable': 796991, 'tokens/total': 1720524, 'epoch': '0.2136'}
	21%\|███████████████████████████████▉ \| 56/263 [41:15<2:37:46, 45.73s/it] 22%\|████████████████████████████████▌ \| 57/263 [42:15<2:51:52, 50.06s/it] {'loss': '0.6921', 'grad_norm': '0.1401', 'learning_rate': '0.0001922', 'ppl': '1.998', 'memory/max_active (GiB)': '54.96', 'memory/max_allocated (GiB)': '54.96', 'memory/device_reserved (GiB)': '78.17', 'tokens/train_per_sec_per_gpu': '296.9', 'tokens/trainable': 814853, 'tokens/total': 1756640, 'epoch': '0.2175'}
	22%\|████████████████████████████████▌ \| 57/263 [42:15<2:51:52, 50.06s/it] 22%\|█████████████████████████████████ \| 58/263 [43:00<2:45:34, 48.46s/it] {'loss': '0.7537', 'grad_norm': '0.1416', 'learning_rate': '0.0001917', 'ppl': '2.125', 'memory/max_active (GiB)': '32.83', 'memory/max_allocated (GiB)': '32.83', 'memory/device_reserved (GiB)': '41.12', 'tokens/train_per_sec_per_gpu': '332.1', 'tokens/trainable': 829705, 'tokens/total': 1793570, 'epoch': '0.2213'}
	22%\|█████████████████████████████████ \| 58/263 [43:00<2:45:34, 48.46s/it] 22%\|█████████████████████████████████▋ \| 59/263 [43:42<2:38:03, 46.49s/it] {'loss': '0.731', 'grad_norm': '0.1516', 'learning_rate': '0.0001911', 'ppl': '2.077', 'memory/max_active (GiB)': '33.92', 'memory/max_allocated (GiB)': '33.92', 'memory/device_reserved (GiB)': '43.25', 'tokens/train_per_sec_per_gpu': '357.5', 'tokens/trainable': 844681, 'tokens/total': 1827586, 'epoch': '0.2251'}
	22%\|█████████████████████████████████▋ \| 59/263 [43:42<2:38:03, 46.49s/it] 23%\|██████████████████████████████████▏ \| 60/263 [44:31<2:40:35, 47.46s/it] {'loss': '0.7322', 'grad_norm': '0.1597', 'learning_rate': '0.0001906', 'ppl': '2.08', 'memory/max_active (GiB)': '27.46', 'memory/max_allocated (GiB)': '27.46', 'memory/device_reserved (GiB)': '42.7', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 855924, 'tokens/total': 1853648, 'epoch': '0.2289'}
	23%\|██████████████████████████████████▏ \| 60/263 [44:31<2:40:35, 47.46s/it] 23%\|██████████████████████████████████▊ \| 61/263 [45:19<2:39:37, 47.41s/it] {'loss': '0.7511', 'grad_norm': '0.1819', 'learning_rate': '0.00019', 'ppl': '2.119', 'memory/max_active (GiB)': '27.97', 'memory/max_allocated (GiB)': '27.97', 'memory/device_reserved (GiB)': '36.56', 'tokens/train_per_sec_per_gpu': '225.6', 'tokens/trainable': 866595, 'tokens/total': 1878090, 'epoch': '0.2327'}
	23%\|██████████████████████████████████▊ \| 61/263 [45:19<2:39:37, 47.41s/it] 24%\|███████████████████████████████████▎ \| 62/263 [46:01<2:33:28, 45.81s/it] {'loss': '0.6627', 'grad_norm': '0.1462', 'learning_rate': '0.0001894', 'ppl': '1.94', 'memory/max_active (GiB)': '38.92', 'memory/max_allocated (GiB)': '38.92', 'memory/device_reserved (GiB)': '53.8', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 882309, 'tokens/total': 1913740, 'epoch': '0.2365'}
	24%\|███████████████████████████████████▎ \| 62/263 [46:01<2:33:28, 45.81s/it] 24%\|███████████████████████████████████▉ \| 63/263 [47:16<3:01:53, 54.57s/it] {'loss': '0.6414', 'grad_norm': '0.1468', 'learning_rate': '0.0001888', 'ppl': '1.899', 'memory/max_active (GiB)': '56.13', 'memory/max_allocated (GiB)': '56.13', 'memory/device_reserved (GiB)': '73.46', 'tokens/train_per_sec_per_gpu': '332.8', 'tokens/trainable': 907269, 'tokens/total': 1971906, 'epoch': '0.2403'}
	24%\|███████████████████████████████████▉ \| 63/263 [47:16<3:01:53, 54.57s/it] 24%\|████████████████████████████████████▌ \| 64/263 [47:56<2:46:47, 50.29s/it] {'loss': '0.7153', 'grad_norm': '0.2024', 'learning_rate': '0.0001882', 'ppl': '2.045', 'memory/max_active (GiB)': '53.82', 'memory/max_allocated (GiB)': '53.82', 'memory/device_reserved (GiB)': '76.49', 'tokens/train_per_sec_per_gpu': '322', 'tokens/trainable': 920250, 'tokens/total': 2002390, 'epoch': '0.2442'}
	24%\|████████████████████████████████████▌ \| 64/263 [47:56<2:46:47, 50.29s/it] 25%\|█████████████████████████████████████ \| 65/263 [48:30<2:29:57, 45.44s/it] {'loss': '0.7389', 'grad_norm': '0.1698', 'learning_rate': '0.0001876', 'ppl': '2.094', 'memory/max_active (GiB)': '29.14', 'memory/max_allocated (GiB)': '29.14', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '324.5', 'tokens/trainable': 931325, 'tokens/total': 2028270, 'epoch': '0.248'}
	25%\|█████████████████████████████████████ \| 65/263 [48:30<2:29:57, 45.44s/it] 25%\|█████████████████████████████████████▋ \| 66/263 [49:17<2:30:15, 45.76s/it] {'loss': '0.8371', 'grad_norm': '0.2279', 'learning_rate': '0.0001869', 'ppl': '2.31', 'memory/max_active (GiB)': '33.75', 'memory/max_allocated (GiB)': '33.75', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '214.5', 'tokens/trainable': 941301, 'tokens/total': 2051108, 'epoch': '0.2518'}
	25%\|█████████████████████████████████████▋ \| 66/263 [49:17<2:30:15, 45.76s/it] 25%\|██████████████████████████████████████▏ \| 67/263 [50:05<2:31:52, 46.49s/it] {'loss': '0.705', 'grad_norm': '0.1503', 'learning_rate': '0.0001863', 'ppl': '2.024', 'memory/max_active (GiB)': '27.18', 'memory/max_allocated (GiB)': '27.18', 'memory/device_reserved (GiB)': '35.22', 'tokens/train_per_sec_per_gpu': '279', 'tokens/trainable': 954748, 'tokens/total': 2079644, 'epoch': '0.2556'}
	25%\|██████████████████████████████████████▏ \| 67/263 [50:05<2:31:52, 46.49s/it] 26%\|██████████████████████████████████████▊ \| 68/263 [50:43<2:23:13, 44.07s/it] {'loss': '0.7131', 'grad_norm': '0.1581', 'learning_rate': '0.0001856', 'ppl': '2.04', 'memory/max_active (GiB)': '34.76', 'memory/max_allocated (GiB)': '34.76', 'memory/device_reserved (GiB)': '43.81', 'tokens/train_per_sec_per_gpu': '356.9', 'tokens/trainable': 968461, 'tokens/total': 2106514, 'epoch': '0.2594'}
	26%\|██████████████████████████████████████▊ \| 68/263 [50:43<2:23:13, 44.07s/it] 26%\|███████████████████████████████████████▎ \| 69/263 [51:16<2:11:22, 40.63s/it] {'loss': '0.7511', 'grad_norm': '0.1736', 'learning_rate': '0.0001849', 'ppl': '2.119', 'memory/max_active (GiB)': '27.61', 'memory/max_allocated (GiB)': '27.61', 'memory/device_reserved (GiB)': '36.19', 'tokens/train_per_sec_per_gpu': '401.6', 'tokens/trainable': 981554, 'tokens/total': 2133242, 'epoch': '0.2632'}
	26%\|███████████████████████████████████████▎ \| 69/263 [51:16<2:11:22, 40.63s/it] 27%\|███████████████████████████████████████▉ \| 70/263 [51:47<2:00:59, 37.61s/it] {'loss': '0.7844', 'grad_norm': '0.1974', 'learning_rate': '0.0001842', 'ppl': '2.191', 'memory/max_active (GiB)': '22.87', 'memory/max_allocated (GiB)': '22.87', 'memory/device_reserved (GiB)': '35.35', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 990994, 'tokens/total': 2153354, 'epoch': '0.267'}
	27%\|███████████████████████████████████████▉ \| 70/263 [51:47<2:00:59, 37.61s/it] 27%\|████████████████████████████████████████▍ \| 71/263 [52:43<2:18:20, 43.23s/it] {'loss': '0.6828', 'grad_norm': '0.2258', 'learning_rate': '0.0001835', 'ppl': '1.98', 'memory/max_active (GiB)': '39.9', 'memory/max_allocated (GiB)': '39.9', 'memory/device_reserved (GiB)': '55.24', 'tokens/train_per_sec_per_gpu': '225.3', 'tokens/trainable': 1003686, 'tokens/total': 2182806, 'epoch': '0.2709'}
	27%\|████████████████████████████████████████▍ \| 71/263 [52:43<2:18:20, 43.23s/it] 27%\|█████████████████████████████████████████ \| 72/263 [54:06<2:55:59, 55.29s/it] {'loss': '0.5772', 'grad_norm': '0.1597', 'learning_rate': '0.0001827', 'ppl': '1.781', 'memory/max_active (GiB)': '59.04', 'memory/max_allocated (GiB)': '59.04', 'memory/device_reserved (GiB)': '76.68', 'tokens/train_per_sec_per_gpu': '300.7', 'tokens/trainable': 1028766, 'tokens/total': 2241490, 'epoch': '0.2747'}
	27%\|█████████████████████████████████████████ \| 72/263 [54:06<2:55:59, 55.29s/it] 28%\|█████████████████████████████████████████▋ \| 73/263 [54:46<2:39:57, 50.51s/it] {'loss': '0.6909', 'grad_norm': '0.1693', 'learning_rate': '0.000182', 'ppl': '1.995', 'memory/max_active (GiB)': '42.9', 'memory/max_allocated (GiB)': '42.9', 'memory/device_reserved (GiB)': '59.81', 'tokens/train_per_sec_per_gpu': '349.6', 'tokens/trainable': 1042530, 'tokens/total': 2272086, 'epoch': '0.2785'}
	28%\|█████████████████████████████████████████▋ \| 73/263 [54:46<2:39:57, 50.51s/it] 28%\|██████████████████████████████████████████▏ \| 74/263 [55:30<2:33:13, 48.65s/it] {'loss': '0.7108', 'grad_norm': '0.1959', 'learning_rate': '0.0001812', 'ppl': '2.036', 'memory/max_active (GiB)': '28.32', 'memory/max_allocated (GiB)': '28.32', 'memory/device_reserved (GiB)': '37.09', 'tokens/train_per_sec_per_gpu': '224', 'tokens/trainable': 1052450, 'tokens/total': 2294892, 'epoch': '0.2823'}
	28%\|██████████████████████████████████████████▏ \| 74/263 [55:30<2:33:13, 48.65s/it] 29%\|██████████████████████████████████████████▊ \| 75/263 [56:22<2:35:57, 49.78s/it] {'loss': '0.7308', 'grad_norm': '0.1824', 'learning_rate': '0.0001804', 'ppl': '2.077', 'memory/max_active (GiB)': '25.12', 'memory/max_allocated (GiB)': '25.12', 'memory/device_reserved (GiB)': '32.62', 'tokens/train_per_sec_per_gpu': '252.4', 'tokens/trainable': 1065680, 'tokens/total': 2320714, 'epoch': '0.2861'}
	29%\|██████████████████████████████████████████▊ \| 75/263 [56:22<2:35:57, 49.78s/it] 29%\|███████████████████████████████████████████▎ \| 76/263 [57:06<2:29:33, 47.99s/it] {'loss': '0.6717', 'grad_norm': '0.1747', 'learning_rate': '0.0001796', 'ppl': '1.958', 'memory/max_active (GiB)': '23.01', 'memory/max_allocated (GiB)': '23.01', 'memory/device_reserved (GiB)': '29.15', 'tokens/train_per_sec_per_gpu': '217.3', 'tokens/trainable': 1075199, 'tokens/total': 2342076, 'epoch': '0.2899'}
	29%\|███████████████████████████████████████████▎ \| 76/263 [57:06<2:29:33, 47.99s/it] 29%\|███████████████████████████████████████████▉ \| 77/263 [57:38<2:13:45, 43.15s/it] {'loss': '0.7283', 'grad_norm': '0.1765', 'learning_rate': '0.0001788', 'ppl': '2.071', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '353.1', 'tokens/trainable': 1086448, 'tokens/total': 2367826, 'epoch': '0.2938'}
	29%\|███████████████████████████████████████████▉ \| 77/263 [57:38<2:13:45, 43.15s/it] 30%\|████████████████████████████████████████████▍ \| 78/263 [58:24<2:15:13, 43.86s/it] {'loss': '0.6897', 'grad_norm': '0.133', 'learning_rate': '0.000178', 'ppl': '1.993', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.97', 'tokens/train_per_sec_per_gpu': '393.6', 'tokens/trainable': 1104361, 'tokens/total': 2406764, 'epoch': '0.2976'}
	30%\|████████████████████████████████████████████▍ \| 78/263 [58:24<2:15:13, 43.86s/it] 30%\|█████████████████████████████████████████████ \| 79/263 [59:16<2:22:50, 46.58s/it] {'loss': '0.726', 'grad_norm': '0.1792', 'learning_rate': '0.0001772', 'ppl': '2.067', 'memory/max_active (GiB)': '40.91', 'memory/max_allocated (GiB)': '40.91', 'memory/device_reserved (GiB)': '56.76', 'tokens/train_per_sec_per_gpu': '245.3', 'tokens/trainable': 1117342, 'tokens/total': 2434766, 'epoch': '0.3014'}
	30%\|█████████████████████████████████████████████ \| 79/263 [59:16<2:22:50, 46.58s/it] 30%\|█████████████████████████████████████████████ \| 80/263 [1:00:12<2:30:27, 49.33s/it] {'loss': '0.7226', 'grad_norm': '0.1773', 'learning_rate': '0.0001763', 'ppl': '2.06', 'memory/max_active (GiB)': '24.17', 'memory/max_allocated (GiB)': '24.17', 'memory/device_reserved (GiB)': '30.76', 'tokens/train_per_sec_per_gpu': '201.2', 'tokens/trainable': 1128559, 'tokens/total': 2459278, 'epoch': '0.3052'}
	30%\|█████████████████████████████████████████████ \| 80/263 [1:00:12<2:30:27, 49.33s/it] 31%\|█████████████████████████████████████████████▌ \| 81/263 [1:00:56<2:24:41, 47.70s/it] {'loss': '0.687', 'grad_norm': '0.2009', 'learning_rate': '0.0001755', 'ppl': '1.988', 'memory/max_active (GiB)': '25.85', 'memory/max_allocated (GiB)': '25.85', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '252', 'tokens/trainable': 1139622, 'tokens/total': 2482288, 'epoch': '0.309'}
	31%\|█████████████████████████████████████████████▌ \| 81/263 [1:00:56<2:24:41, 47.70s/it] 31%\|██████████████████████████████████████████████▏ \| 82/263 [1:01:30<2:11:37, 43.64s/it] {'loss': '0.6404', 'grad_norm': '0.1794', 'learning_rate': '0.0001746', 'ppl': '1.897', 'memory/max_active (GiB)': '34.83', 'memory/max_allocated (GiB)': '34.83', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '332.6', 'tokens/trainable': 1150978, 'tokens/total': 2508102, 'epoch': '0.3128'}
	31%\|██████████████████████████████████████████████▏ \| 82/263 [1:01:30<2:11:37, 43.64s/it] 32%\|██████████████████████████████████████████████▋ \| 83/263 [1:02:17<2:13:57, 44.65s/it] {'loss': '0.6434', 'grad_norm': '0.1725', 'learning_rate': '0.0001737', 'ppl': '1.903', 'memory/max_active (GiB)': '34.96', 'memory/max_allocated (GiB)': '34.96', 'memory/device_reserved (GiB)': '43.95', 'tokens/train_per_sec_per_gpu': '264.2', 'tokens/trainable': 1163404, 'tokens/total': 2534552, 'epoch': '0.3166'}
	32%\|██████████████████████████████████████████████▋ \| 83/263 [1:02:17<2:13:57, 44.65s/it] 32%\|███████████████████████████████████████████████▎ \| 84/263 [1:03:06<2:17:12, 45.99s/it] {'loss': '0.6432', 'grad_norm': '0.144', 'learning_rate': '0.0001728', 'ppl': '1.903', 'memory/max_active (GiB)': '24.85', 'memory/max_allocated (GiB)': '24.85', 'memory/device_reserved (GiB)': '32.01', 'tokens/train_per_sec_per_gpu': '300', 'tokens/trainable': 1178135, 'tokens/total': 2562982, 'epoch': '0.3205'}
	32%\|███████████████████████████████████████████████▎ \| 84/263 [1:03:06<2:17:12, 45.99s/it] 32%\|███████████████████████████████████████████████▊ \| 85/263 [1:03:53<2:16:37, 46.05s/it] {'loss': '0.7349', 'grad_norm': '0.2154', 'learning_rate': '0.0001719', 'ppl': '2.085', 'memory/max_active (GiB)': '25.97', 'memory/max_allocated (GiB)': '25.97', 'memory/device_reserved (GiB)': '33.5', 'tokens/train_per_sec_per_gpu': '201.4', 'tokens/trainable': 1187438, 'tokens/total': 2585130, 'epoch': '0.3243'}
	32%\|███████████████████████████████████████████████▊ \| 85/263 [1:03:53<2:16:37, 46.05s/it] 33%\|████████████████████████████████████████████████▍ \| 86/263 [1:04:34<2:11:44, 44.66s/it] {'loss': '0.667', 'grad_norm': '0.1647', 'learning_rate': '0.0001709', 'ppl': '1.948', 'memory/max_active (GiB)': '32.35', 'memory/max_allocated (GiB)': '32.35', 'memory/device_reserved (GiB)': '35.79', 'tokens/train_per_sec_per_gpu': '308.1', 'tokens/trainable': 1200197, 'tokens/total': 2610804, 'epoch': '0.3281'}
	33%\|████████████████████████████████████████████████▍ \| 86/263 [1:04:34<2:11:44, 44.66s/it] 33%\|████████████████████████████████████████████████▉ \| 87/263 [1:05:11<2:03:50, 42.22s/it] {'loss': '0.7045', 'grad_norm': '0.1813', 'learning_rate': '0.00017', 'ppl': '2.023', 'memory/max_active (GiB)': '34.17', 'memory/max_allocated (GiB)': '34.17', 'memory/device_reserved (GiB)': '42.85', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 1211474, 'tokens/total': 2635758, 'epoch': '0.3319'}
	33%\|████████████████████████████████████████████████▉ \| 87/263 [1:05:11<2:03:50, 42.22s/it] 33%\|█████████████████████████████████████████████████▌ \| 88/263 [1:06:03<2:12:03, 45.28s/it] {'loss': '0.7118', 'grad_norm': '0.1373', 'learning_rate': '0.0001691', 'ppl': '2.038', 'memory/max_active (GiB)': '49.54', 'memory/max_allocated (GiB)': '49.54', 'memory/device_reserved (GiB)': '69.87', 'tokens/train_per_sec_per_gpu': '390', 'tokens/trainable': 1231915, 'tokens/total': 2679352, 'epoch': '0.3357'}
	33%\|█████████████████████████████████████████████████▌ \| 88/263 [1:06:03<2:12:03, 45.28s/it] 34%\|██████████████████████████████████████████████████ \| 89/263 [1:06:36<2:00:21, 41.50s/it] {'loss': '0.7212', 'grad_norm': '0.1959', 'learning_rate': '0.0001681', 'ppl': '2.057', 'memory/max_active (GiB)': '26.55', 'memory/max_allocated (GiB)': '26.55', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '346.5', 'tokens/trainable': 1243243, 'tokens/total': 2703340, 'epoch': '0.3395'}
	34%\|██████████████████████████████████████████████████ \| 89/263 [1:06:36<2:00:21, 41.50s/it] 34%\|██████████████████████████████████████████████████▋ \| 90/263 [1:07:06<1:50:05, 38.18s/it] {'loss': '0.7258', 'grad_norm': '0.2195', 'learning_rate': '0.0001671', 'ppl': '2.066', 'memory/max_active (GiB)': '30.13', 'memory/max_allocated (GiB)': '30.13', 'memory/device_reserved (GiB)': '39.89', 'tokens/train_per_sec_per_gpu': '316.4', 'tokens/trainable': 1252869, 'tokens/total': 2726904, 'epoch': '0.3433'}
	34%\|██████████████████████████████████████████████████▋ \| 90/263 [1:07:06<1:50:05, 38.18s/it] 35%\|███████████████████████████████████████████████████▏ \| 91/263 [1:07:50<1:54:23, 39.90s/it] {'loss': '0.7356', 'grad_norm': '0.1998', 'learning_rate': '0.0001661', 'ppl': '2.087', 'memory/max_active (GiB)': '36.96', 'memory/max_allocated (GiB)': '36.96', 'memory/device_reserved (GiB)': '46.66', 'tokens/train_per_sec_per_gpu': '350.6', 'tokens/trainable': 1268267, 'tokens/total': 2760058, 'epoch': '0.3472'}
	35%\|███████████████████████████████████████████████████▏ \| 91/263 [1:07:50<1:54:23, 39.90s/it] 35%\|███████████████████████████████████████████████████▊ \| 92/263 [1:08:35<1:57:53, 41.37s/it] {'loss': '0.6534', 'grad_norm': '0.1845', 'learning_rate': '0.0001651', 'ppl': '1.922', 'memory/max_active (GiB)': '35.55', 'memory/max_allocated (GiB)': '35.55', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '339.5', 'tokens/trainable': 1283469, 'tokens/total': 2794120, 'epoch': '0.351'}
	35%\|███████████████████████████████████████████████████▊ \| 92/263 [1:08:35<1:57:53, 41.37s/it] 35%\|████████████████████████████████████████████████████▎ \| 93/263 [1:09:07<1:49:00, 38.47s/it] {'loss': '0.7409', 'grad_norm': '0.1926', 'learning_rate': '0.0001641', 'ppl': '2.098', 'memory/max_active (GiB)': '23.77', 'memory/max_allocated (GiB)': '23.77', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '334.8', 'tokens/trainable': 1294090, 'tokens/total': 2818588, 'epoch': '0.3548'}
	35%\|████████████████████████████████████████████████████▎ \| 93/263 [1:09:07<1:49:00, 38.47s/it] 36%\|████████████████████████████████████████████████████▉ \| 94/263 [1:09:49<1:51:37, 39.63s/it] {'loss': '0.6379', 'grad_norm': '0.1578', 'learning_rate': '0.0001631', 'ppl': '1.892', 'memory/max_active (GiB)': '34.71', 'memory/max_allocated (GiB)': '34.71', 'memory/device_reserved (GiB)': '43.64', 'tokens/train_per_sec_per_gpu': '369.7', 'tokens/trainable': 1309742, 'tokens/total': 2851686, 'epoch': '0.3586'}
	36%\|████████████████████████████████████████████████████▉ \| 94/263 [1:09:49<1:51:37, 39.63s/it] 36%\|█████████████████████████████████████████████████████▍ \| 95/263 [1:10:26<1:49:15, 39.02s/it] {'loss': '0.7076', 'grad_norm': '0.1786', 'learning_rate': '0.0001621', 'ppl': '2.029', 'memory/max_active (GiB)': '33.53', 'memory/max_allocated (GiB)': '33.53', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '353.4', 'tokens/trainable': 1323027, 'tokens/total': 2880764, 'epoch': '0.3624'}
	36%\|█████████████████████████████████████████████████████▍ \| 95/263 [1:10:26<1:49:15, 39.02s/it] 37%\|██████████████████████████████████████████████████████ \| 96/263 [1:10:59<1:42:48, 36.94s/it] {'loss': '0.6912', 'grad_norm': '0.1893', 'learning_rate': '0.000161', 'ppl': '1.996', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '44.59', 'tokens/train_per_sec_per_gpu': '352.2', 'tokens/trainable': 1334326, 'tokens/total': 2905950, 'epoch': '0.3662'}
	37%\|██████████████████████████████████████████████████████ \| 96/263 [1:10:59<1:42:48, 36.94s/it] 37%\|██████████████████████████████████████████████████████▌ \| 97/263 [1:11:43<1:48:20, 39.16s/it] {'loss': '0.6708', 'grad_norm': '0.173', 'learning_rate': '0.00016', 'ppl': '1.956', 'memory/max_active (GiB)': '47.2', 'memory/max_allocated (GiB)': '47.2', 'memory/device_reserved (GiB)': '66.49', 'tokens/train_per_sec_per_gpu': '351.1', 'tokens/trainable': 1349894, 'tokens/total': 2942380, 'epoch': '0.3701'}
	37%\|██████████████████████████████████████████████████████▌ \| 97/263 [1:11:43<1:48:20, 39.16s/it] 37%\|███████████████████████████████████████████████████████▏ \| 98/263 [1:12:38<2:00:32, 43.83s/it] {'loss': '0.685', 'grad_norm': '0.1576', 'learning_rate': '0.0001589', 'ppl': '1.984', 'memory/max_active (GiB)': '37.19', 'memory/max_allocated (GiB)': '37.19', 'memory/device_reserved (GiB)': '47.07', 'tokens/train_per_sec_per_gpu': '357.9', 'tokens/trainable': 1369481, 'tokens/total': 2988904, 'epoch': '0.3739'}
	37%\|███████████████████████████████████████████████████████▏ \| 98/263 [1:12:38<2:00:32, 43.83s/it] 38%\|███████████████████████████████████████████████████████▋ \| 99/263 [1:13:21<1:59:39, 43.77s/it] {'loss': '0.6715', 'grad_norm': '0.1494', 'learning_rate': '0.0001578', 'ppl': '1.957', 'memory/max_active (GiB)': '42.43', 'memory/max_allocated (GiB)': '42.43', 'memory/device_reserved (GiB)': '59.1', 'tokens/train_per_sec_per_gpu': '384.8', 'tokens/trainable': 1386274, 'tokens/total': 3026814, 'epoch': '0.3777'}
	38%\|███████████████████████████████████████████████████████▋ \| 99/263 [1:13:21<1:59:39, 43.77s/it] 38%\|███████████████████████████████████████████████████████▉ \| 100/263 [1:13:55<1:51:09, 40.92s/it] {'loss': '0.6481', 'grad_norm': '0.208', 'learning_rate': '0.0001567', 'ppl': '1.912', 'memory/max_active (GiB)': '26.51', 'memory/max_allocated (GiB)': '26.51', 'memory/device_reserved (GiB)': '34.38', 'tokens/train_per_sec_per_gpu': '296.5', 'tokens/trainable': 1396430, 'tokens/total': 3050820, 'epoch': '0.3815'}
	38%\|███████████████████████████████████████████████████████▉ \| 100/263 [1:13:55<1:51:09, 40.92s/it] 38%\|████████████████████████████████████████████████████████▍ \| 101/263 [1:14:53<2:04:04, 45.96s/it] {'loss': '0.7205', 'grad_norm': '0.1546', 'learning_rate': '0.0001556', 'ppl': '2.056', 'memory/max_active (GiB)': '44.95', 'memory/max_allocated (GiB)': '44.95', 'memory/device_reserved (GiB)': '62.88', 'tokens/train_per_sec_per_gpu': '350', 'tokens/trainable': 1416631, 'tokens/total': 3099000, 'epoch': '0.3853'}
	38%\|████████████████████████████████████████████████████████▍ \| 101/263 [1:14:53<2:04:04, 45.96s/it] 39%\|█████████████████████████████████████████████████████████ \| 102/263 [1:15:38<2:02:20, 45.60s/it] {'loss': '0.7284', 'grad_norm': '0.1686', 'learning_rate': '0.0001545', 'ppl': '2.072', 'memory/max_active (GiB)': '32', 'memory/max_allocated (GiB)': '32', 'memory/device_reserved (GiB)': '42.76', 'tokens/train_per_sec_per_gpu': '367.9', 'tokens/trainable': 1433096, 'tokens/total': 3134960, 'epoch': '0.3891'}
	39%\|█████████████████████████████████████████████████████████ \| 102/263 [1:15:38<2:02:20, 45.60s/it] 39%\|█████████████████████████████████████████████████████████▌ \| 103/263 [1:16:06<1:47:33, 40.33s/it] {'loss': '0.6715', 'grad_norm': '0.1936', 'learning_rate': '0.0001534', 'ppl': '1.957', 'memory/max_active (GiB)': '26.34', 'memory/max_allocated (GiB)': '26.34', 'memory/device_reserved (GiB)': '34.07', 'tokens/train_per_sec_per_gpu': '351.9', 'tokens/trainable': 1442967, 'tokens/total': 3156628, 'epoch': '0.3929'}
	39%\|█████████████████████████████████████████████████████████▌ \| 103/263 [1:16:06<1:47:33, 40.33s/it] 40%\|██████████████████████████████████████████████████████████▏ \| 104/263 [1:16:41<1:42:26, 38.66s/it] {'loss': '0.5962', 'grad_norm': '0.1888', 'learning_rate': '0.0001523', 'ppl': '1.815', 'memory/max_active (GiB)': '28.66', 'memory/max_allocated (GiB)': '28.66', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 1453178, 'tokens/total': 3180850, 'epoch': '0.3968'}
	40%\|██████████████████████████████████████████████████████████▏ \| 104/263 [1:16:41<1:42:26, 38.66s/it] 40%\|██████████████████████████████████████████████████████████▋ \| 105/263 [1:17:11<1:35:30, 36.27s/it] {'loss': '0.6637', 'grad_norm': '0.1802', 'learning_rate': '0.0001511', 'ppl': '1.942', 'memory/max_active (GiB)': '28.04', 'memory/max_allocated (GiB)': '28.04', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '375', 'tokens/trainable': 1464686, 'tokens/total': 3203888, 'epoch': '0.4006'}
	40%\|██████████████████████████████████████████████████████████▋ \| 105/263 [1:17:11<1:35:30, 36.27s/it] 40%\|███████████████████████████████████████████████████████████▏ \| 106/263 [1:17:42<1:30:16, 34.50s/it] {'loss': '0.7321', 'grad_norm': '0.2234', 'learning_rate': '0.00015', 'ppl': '2.079', 'memory/max_active (GiB)': '21.56', 'memory/max_allocated (GiB)': '21.56', 'memory/device_reserved (GiB)': '28.04', 'tokens/train_per_sec_per_gpu': '313.8', 'tokens/trainable': 1474221, 'tokens/total': 3224060, 'epoch': '0.4044'}
	40%\|███████████████████████████████████████████████████████████▏ \| 106/263 [1:17:42<1:30:16, 34.50s/it] 41%\|███████████████████████████████████████████████████████████▊ \| 107/263 [1:18:34<1:43:12, 39.70s/it] {'loss': '0.5681', 'grad_norm': '0.162', 'learning_rate': '0.0001488', 'ppl': '1.765', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '53.23', 'tokens/train_per_sec_per_gpu': '297.5', 'tokens/trainable': 1489638, 'tokens/total': 3260576, 'epoch': '0.4082'}
	41%\|███████████████████████████████████████████████████████████▊ \| 107/263 [1:18:34<1:43:12, 39.70s/it] 41%\|████████████████████████████████████████████████████████████▎ \| 108/263 [1:19:20<1:47:51, 41.75s/it] {'loss': '0.708', 'grad_norm': '0.197', 'learning_rate': '0.0001477', 'ppl': '2.03', 'memory/max_active (GiB)': '29.18', 'memory/max_allocated (GiB)': '29.18', 'memory/device_reserved (GiB)': '38.48', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 1502425, 'tokens/total': 3287854, 'epoch': '0.412'}
	41%\|████████████████████████████████████████████████████████████▎ \| 108/263 [1:19:20<1:47:51, 41.75s/it] 41%\|████████████████████████████████████████████████████████████▉ \| 109/263 [1:20:18<1:59:25, 46.53s/it] {'loss': '0.6689', 'grad_norm': '0.1784', 'learning_rate': '0.0001465', 'ppl': '1.952', 'memory/max_active (GiB)': '38.9', 'memory/max_allocated (GiB)': '38.9', 'memory/device_reserved (GiB)': '53.77', 'tokens/train_per_sec_per_gpu': '320.6', 'tokens/trainable': 1520916, 'tokens/total': 3326198, 'epoch': '0.4158'}
	41%\|████████████████████████████████████████████████████████████▉ \| 109/263 [1:20:18<1:59:25, 46.53s/it] 42%\|█████████████████████████████████████████████████████████████▍ \| 110/263 [1:21:01<1:56:00, 45.49s/it] {'loss': '0.6355', 'grad_norm': '0.1931', 'learning_rate': '0.0001453', 'ppl': '1.888', 'memory/max_active (GiB)': '28.51', 'memory/max_allocated (GiB)': '28.51', 'memory/device_reserved (GiB)': '43.21', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 1532644, 'tokens/total': 3353708, 'epoch': '0.4196'}
	42%\|█████████████████████████████████████████████████████████████▍ \| 110/263 [1:21:01<1:56:00, 45.49s/it] 42%\|██████████████████████████████████████████████████████████████ \| 111/263 [1:21:37<1:48:21, 42.77s/it] {'loss': '0.6837', 'grad_norm': '0.1755', 'learning_rate': '0.0001442', 'ppl': '1.981', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.12', 'tokens/train_per_sec_per_gpu': '395.7', 'tokens/trainable': 1547058, 'tokens/total': 3383196, 'epoch': '0.4235'}
	42%\|██████████████████████████████████████████████████████████████ \| 111/263 [1:21:37<1:48:21, 42.77s/it] 43%\|██████████████████████████████████████████████████████████████▌ \| 112/263 [1:22:31<1:55:30, 45.90s/it] {'loss': '0.6215', 'grad_norm': '0.2103', 'learning_rate': '0.000143', 'ppl': '1.862', 'memory/max_active (GiB)': '40.6', 'memory/max_allocated (GiB)': '40.6', 'memory/device_reserved (GiB)': '56.43', 'tokens/train_per_sec_per_gpu': '248.4', 'tokens/trainable': 1560272, 'tokens/total': 3415688, 'epoch': '0.4273'}
	43%\|██████████████████████████████████████████████████████████████▌ \| 112/263 [1:22:31<1:55:30, 45.90s/it] 43%\|███████████████████████████████████████████████████████████████▏ \| 113/263 [1:23:20<1:57:39, 47.06s/it] {'loss': '0.6699', 'grad_norm': '0.1813', 'learning_rate': '0.0001418', 'ppl': '1.954', 'memory/max_active (GiB)': '37.04', 'memory/max_allocated (GiB)': '37.04', 'memory/device_reserved (GiB)': '46.8', 'tokens/train_per_sec_per_gpu': '282.5', 'tokens/trainable': 1574335, 'tokens/total': 3444166, 'epoch': '0.4311'}
	43%\|███████████████████████████████████████████████████████████████▏ \| 113/263 [1:23:20<1:57:39, 47.06s/it] 43%\|███████████████████████████████████████████████████████████████▋ \| 114/263 [1:24:12<2:00:02, 48.34s/it] {'loss': '0.594', 'grad_norm': '0.1998', 'learning_rate': '0.0001406', 'ppl': '1.811', 'memory/max_active (GiB)': '36.6', 'memory/max_allocated (GiB)': '36.6', 'memory/device_reserved (GiB)': '46.27', 'tokens/train_per_sec_per_gpu': '276.6', 'tokens/trainable': 1588531, 'tokens/total': 3472530, 'epoch': '0.4349'}
	43%\|███████████████████████████████████████████████████████████████▋ \| 114/263 [1:24:12<2:00:02, 48.34s/it] 44%\|████████████████████████████████████████████████████████████████▎ \| 115/263 [1:25:04<2:01:51, 49.40s/it] {'loss': '0.7205', 'grad_norm': '0.1913', 'learning_rate': '0.0001393', 'ppl': '2.055', 'memory/max_active (GiB)': '37.17', 'memory/max_allocated (GiB)': '37.17', 'memory/device_reserved (GiB)': '47.05', 'tokens/train_per_sec_per_gpu': '279.3', 'tokens/trainable': 1603020, 'tokens/total': 3503644, 'epoch': '0.4387'}
	44%\|████████████████████████████████████████████████████████████████▎ \| 115/263 [1:25:04<2:01:51, 49.40s/it] 44%\|████████████████████████████████████████████████████████████████▊ \| 116/263 [1:25:48<1:57:32, 47.97s/it] {'loss': '0.7426', 'grad_norm': '0.2782', 'learning_rate': '0.0001381', 'ppl': '2.101', 'memory/max_active (GiB)': '40', 'memory/max_allocated (GiB)': '40', 'memory/device_reserved (GiB)': '55.41', 'tokens/train_per_sec_per_gpu': '231.5', 'tokens/trainable': 1613354, 'tokens/total': 3529882, 'epoch': '0.4425'}
	44%\|████████████████████████████████████████████████████████████████▊ \| 116/263 [1:25:48<1:57:32, 47.97s/it] 44%\|█████████████████████████████████████████████████████████████████▍ \| 117/263 [1:26:52<2:08:08, 52.66s/it] {'loss': '0.6484', 'grad_norm': '0.1659', 'learning_rate': '0.0001369', 'ppl': '1.913', 'memory/max_active (GiB)': '54.45', 'memory/max_allocated (GiB)': '54.45', 'memory/device_reserved (GiB)': '77.33', 'tokens/train_per_sec_per_gpu': '272', 'tokens/trainable': 1630658, 'tokens/total': 3566156, 'epoch': '0.4464'}
	44%\|█████████████████████████████████████████████████████████████████▍ \| 117/263 [1:26:52<2:08:08, 52.66s/it] 45%\|█████████████████████████████████████████████████████████████████▉ \| 118/263 [1:27:52<2:12:36, 54.87s/it] {'loss': '0.611', 'grad_norm': '0.1486', 'learning_rate': '0.0001357', 'ppl': '1.842', 'memory/max_active (GiB)': '43.75', 'memory/max_allocated (GiB)': '43.75', 'memory/device_reserved (GiB)': '61.14', 'tokens/train_per_sec_per_gpu': '329.4', 'tokens/trainable': 1650425, 'tokens/total': 3606182, 'epoch': '0.4502'}
	45%\|█████████████████████████████████████████████████████████████████▉ \| 118/263 [1:27:52<2:12:36, 54.87s/it] 45%\|██████████████████████████████████████████████████████████████████▌ \| 119/263 [1:28:50<2:13:58, 55.82s/it] {'loss': '0.6299', 'grad_norm': '0.1702', 'learning_rate': '0.0001344', 'ppl': '1.877', 'memory/max_active (GiB)': '60.94', 'memory/max_allocated (GiB)': '60.94', 'memory/device_reserved (GiB)': '69.08', 'tokens/train_per_sec_per_gpu': '289.6', 'tokens/trainable': 1667235, 'tokens/total': 3645356, 'epoch': '0.454'}
	45%\|██████████████████████████████████████████████████████████████████▌ \| 119/263 [1:28:50<2:13:58, 55.82s/it] 46%\|███████████████████████████████████████████████████████████████████ \| 120/263 [1:29:35<2:05:41, 52.74s/it] {'loss': '0.7218', 'grad_norm': '0.1997', 'learning_rate': '0.0001332', 'ppl': '2.058', 'memory/max_active (GiB)': '37.26', 'memory/max_allocated (GiB)': '37.26', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '310.7', 'tokens/trainable': 1681386, 'tokens/total': 3674268, 'epoch': '0.4578'}
	46%\|███████████████████████████████████████████████████████████████████ \| 120/263 [1:29:35<2:05:41, 52.74s/it] 46%\|███████████████████████████████████████████████████████████████████▋ \| 121/263 [1:30:33<2:08:14, 54.19s/it] {'loss': '0.6603', 'grad_norm': '0.2087', 'learning_rate': '0.0001319', 'ppl': '1.935', 'memory/max_active (GiB)': '42.12', 'memory/max_allocated (GiB)': '42.12', 'memory/device_reserved (GiB)': '58.63', 'tokens/train_per_sec_per_gpu': '268.9', 'tokens/trainable': 1696869, 'tokens/total': 3710714, 'epoch': '0.4616'}
	46%\|███████████████████████████████████████████████████████████████████▋ \| 121/263 [1:30:33<2:08:14, 54.19s/it] 46%\|████████████████████████████████████████████████████████████████████▏ \| 122/263 [1:31:17<2:00:10, 51.14s/it] {'loss': '0.7366', 'grad_norm': '0.2266', 'learning_rate': '0.0001306', 'ppl': '2.089', 'memory/max_active (GiB)': '23.04', 'memory/max_allocated (GiB)': '23.04', 'memory/device_reserved (GiB)': '45.24', 'tokens/train_per_sec_per_gpu': '206.1', 'tokens/trainable': 1705941, 'tokens/total': 3730260, 'epoch': '0.4654'}
	46%\|████████████████████████████████████████████████████████████████████▏ \| 122/263 [1:31:17<2:00:10, 51.14s/it] 47%\|████████████████████████████████████████████████████████████████████▋ \| 123/263 [1:32:05<1:57:21, 50.29s/it] {'loss': '0.7187', 'grad_norm': '0.164', 'learning_rate': '0.0001294', 'ppl': '2.052', 'memory/max_active (GiB)': '31.57', 'memory/max_allocated (GiB)': '31.57', 'memory/device_reserved (GiB)': '41.88', 'tokens/train_per_sec_per_gpu': '350.2', 'tokens/trainable': 1722869, 'tokens/total': 3764600, 'epoch': '0.4692'}
	47%\|████████████████████████████████████████████████████████████████████▋ \| 123/263 [1:32:05<1:57:21, 50.29s/it] 47%\|█████████████████████████████████████████████████████████████████████▎ \| 124/263 [1:32:54<1:55:10, 49.72s/it] {'loss': '0.7666', 'grad_norm': '0.2024', 'learning_rate': '0.0001281', 'ppl': '2.152', 'memory/max_active (GiB)': '28.74', 'memory/max_allocated (GiB)': '28.74', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '259.1', 'tokens/trainable': 1735401, 'tokens/total': 3789866, 'epoch': '0.4731'}
	47%\|█████████████████████████████████████████████████████████████████████▎ \| 124/263 [1:32:54<1:55:10, 49.72s/it] 48%\|█████████████████████████████████████████████████████████████████████▊ \| 125/263 [1:33:56<2:03:06, 53.52s/it] {'loss': '0.6747', 'grad_norm': '0.1734', 'learning_rate': '0.0001268', 'ppl': '1.963', 'memory/max_active (GiB)': '36.35', 'memory/max_allocated (GiB)': '36.35', 'memory/device_reserved (GiB)': '45.94', 'tokens/train_per_sec_per_gpu': '266.3', 'tokens/trainable': 1752018, 'tokens/total': 3826538, 'epoch': '0.4769'}
	48%\|█████████████████████████████████████████████████████████████████████▊ \| 125/263 [1:33:56<2:03:06, 53.52s/it] 48%\|██████████████████████████████████████████████████████████████████████▍ \| 126/263 [1:34:58<2:07:43, 55.94s/it] {'loss': '0.7026', 'grad_norm': '0.1709', 'learning_rate': '0.0001256', 'ppl': '2.019', 'memory/max_active (GiB)': '34.18', 'memory/max_allocated (GiB)': '34.18', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '257.9', 'tokens/trainable': 1767897, 'tokens/total': 3861520, 'epoch': '0.4807'}
	48%\|██████████████████████████████████████████████████████████████████████▍ \| 126/263 [1:34:58<2:07:43, 55.94s/it] 48%\|██████████████████████████████████████████████████████████████████████▉ \| 127/263 [1:35:43<1:59:53, 52.89s/it] {'loss': '0.7321', 'grad_norm': '0.2378', 'learning_rate': '0.0001243', 'ppl': '2.079', 'memory/max_active (GiB)': '26.64', 'memory/max_allocated (GiB)': '26.64', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '211', 'tokens/trainable': 1777560, 'tokens/total': 3883630, 'epoch': '0.4845'}
	48%\|██████████████████████████████████████████████████████████████████████▉ \| 127/263 [1:35:43<1:59:53, 52.89s/it] 49%\|███████████████████████████████████████████████████████████████████████▌ \| 128/263 [1:36:21<1:48:48, 48.36s/it] {'loss': '0.6494', 'grad_norm': '0.1878', 'learning_rate': '0.000123', 'ppl': '1.914', 'memory/max_active (GiB)': '23.99', 'memory/max_allocated (GiB)': '23.99', 'memory/device_reserved (GiB)': '31.15', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 1789026, 'tokens/total': 3907826, 'epoch': '0.4883'}
	49%\|███████████████████████████████████████████████████████████████████████▌ \| 128/263 [1:36:21<1:48:48, 48.36s/it] 49%\|████████████████████████████████████████████████████████████████████████ \| 129/263 [1:37:18<1:53:46, 50.95s/it] {'loss': '0.6286', 'grad_norm': '0.1865', 'learning_rate': '0.0001217', 'ppl': '1.875', 'memory/max_active (GiB)': '34.03', 'memory/max_allocated (GiB)': '34.03', 'memory/device_reserved (GiB)': '42.72', 'tokens/train_per_sec_per_gpu': '281.1', 'tokens/trainable': 1805039, 'tokens/total': 3942120, 'epoch': '0.4921'}
	49%\|████████████████████████████████████████████████████████████████████████ \| 129/263 [1:37:18<1:53:46, 50.95s/it] 49%\|████████████████████████████████████████████████████████████████████████▋ \| 130/263 [1:38:19<1:59:46, 54.03s/it] {'loss': '0.5905', 'grad_norm': '0.1831', 'learning_rate': '0.0001204', 'ppl': '1.805', 'memory/max_active (GiB)': '40.32', 'memory/max_allocated (GiB)': '40.32', 'memory/device_reserved (GiB)': '55.9', 'tokens/train_per_sec_per_gpu': '282.4', 'tokens/trainable': 1822330, 'tokens/total': 3976238, 'epoch': '0.4959'}
	49%\|████████████████████████████████████████████████████████████████████████▋ \| 130/263 [1:38:19<1:59:46, 54.03s/it] 50%\|█████████████████████████████████████████████████████████████████████████▏ \| 131/263 [1:39:03<1:52:00, 50.91s/it] {'loss': '0.6443', 'grad_norm': '0.1812', 'learning_rate': '0.0001191', 'ppl': '1.905', 'memory/max_active (GiB)': '33.09', 'memory/max_allocated (GiB)': '33.09', 'memory/device_reserved (GiB)': '41.43', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 1835502, 'tokens/total': 4004024, 'epoch': '0.4998'}
	50%\|█████████████████████████████████████████████████████████████████████████▏ \| 131/263 [1:39:03<1:52:00, 50.91s/it] 50%\|█████████████████████████████████████████████████████████████████████████▊ \| 132/263 [1:40:07<1:59:46, 54.86s/it] {'loss': '0.6177', 'grad_norm': '0.158', 'learning_rate': '0.0001178', 'ppl': '1.855', 'memory/max_active (GiB)': '44.4', 'memory/max_allocated (GiB)': '44.4', 'memory/device_reserved (GiB)': '62.21', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 1854705, 'tokens/total': 4047366, 'epoch': '0.5036'}
	50%\|█████████████████████████████████████████████████████████████████████████▊ \| 132/263 [1:40:07<1:59:46, 54.86s/it] 51%\|██████████████████████████████████████████████████████████████████████████▎ \| 133/263 [1:41:01<1:58:27, 54.67s/it] {'loss': '0.6597', 'grad_norm': '0.197', 'learning_rate': '0.0001165', 'ppl': '1.934', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '225.4', 'tokens/trainable': 1866931, 'tokens/total': 4073028, 'epoch': '0.5074'}
	51%\|██████████████████████████████████████████████████████████████████████████▎ \| 133/263 [1:41:01<1:58:27, 54.67s/it] 51%\|██████████████████████████████████████████████████████████████████████████▉ \| 134/263 [1:42:00<2:00:25, 56.01s/it] {'loss': '0.6264', 'grad_norm': '0.1803', 'learning_rate': '0.0001152', 'ppl': '1.871', 'memory/max_active (GiB)': '35.15', 'memory/max_allocated (GiB)': '35.15', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '217.6', 'tokens/trainable': 1879795, 'tokens/total': 4103326, 'epoch': '0.5112'}
	51%\|██████████████████████████████████████████████████████████████████████████▉ \| 134/263 [1:42:01<2:00:25, 56.01s/it] 51%\|███████████████████████████████████████████████████████████████████████████▍ \| 135/263 [1:43:08<2:06:45, 59.42s/it] {'loss': '0.5894', 'grad_norm': '0.1796', 'learning_rate': '0.0001139', 'ppl': '1.803', 'memory/max_active (GiB)': '60.2', 'memory/max_allocated (GiB)': '60.2', 'memory/device_reserved (GiB)': '78.05', 'tokens/train_per_sec_per_gpu': '280.6', 'tokens/trainable': 1898706, 'tokens/total': 4145556, 'epoch': '0.515'}
	51%\|███████████████████████████████████████████████████████████████████████████▍ \| 135/263 [1:43:08<2:06:45, 59.42s/it] 52%\|████████████████████████████████████████████████████████████████████████████ \| 136/263 [1:44:02<2:02:32, 57.90s/it] {'loss': '0.6652', 'grad_norm': '0.1649', 'learning_rate': '0.0001126', 'ppl': '1.945', 'memory/max_active (GiB)': '36.24', 'memory/max_allocated (GiB)': '36.24', 'memory/device_reserved (GiB)': '45.69', 'tokens/train_per_sec_per_gpu': '253.7', 'tokens/trainable': 1912491, 'tokens/total': 4178912, 'epoch': '0.5188'}
	52%\|████████████████████████████████████████████████████████████████████████████ \| 136/263 [1:44:02<2:02:32, 57.90s/it] 52%\|████████████████████████████████████████████████████████████████████████████▌ \| 137/263 [1:45:08<2:06:33, 60.27s/it] {'loss': '0.6049', 'grad_norm': '0.1469', 'learning_rate': '0.0001112', 'ppl': '1.831', 'memory/max_active (GiB)': '54.11', 'memory/max_allocated (GiB)': '54.11', 'memory/device_reserved (GiB)': '77.07', 'tokens/train_per_sec_per_gpu': '291.8', 'tokens/trainable': 1931691, 'tokens/total': 4219728, 'epoch': '0.5227'}
	52%\|████████████████████████████████████████████████████████████████████████████▌ \| 137/263 [1:45:08<2:06:33, 60.27s/it] 52%\|█████████████████████████████████████████████████████████████████████████████▏ \| 138/263 [1:45:55<1:57:30, 56.40s/it] {'loss': '0.7114', 'grad_norm': '0.219', 'learning_rate': '0.0001099', 'ppl': '2.037', 'memory/max_active (GiB)': '29.78', 'memory/max_allocated (GiB)': '29.78', 'memory/device_reserved (GiB)': '39.28', 'tokens/train_per_sec_per_gpu': '230.8', 'tokens/trainable': 1942627, 'tokens/total': 4246086, 'epoch': '0.5265'}
	52%\|█████████████████████████████████████████████████████████████████████████████▏ \| 138/263 [1:45:55<1:57:30, 56.40s/it] 53%\|█████████████████████████████████████████████████████████████████████████████▋ \| 139/263 [1:46:33<1:44:54, 50.76s/it] {'loss': '0.6658', 'grad_norm': '0.2107', 'learning_rate': '0.0001086', 'ppl': '1.946', 'memory/max_active (GiB)': '32.15', 'memory/max_allocated (GiB)': '32.15', 'memory/device_reserved (GiB)': '35.63', 'tokens/train_per_sec_per_gpu': '371.6', 'tokens/trainable': 1956598, 'tokens/total': 4274466, 'epoch': '0.5303'}
	53%\|█████████████████████████████████████████████████████████████████████████████▋ \| 139/263 [1:46:33<1:44:54, 50.76s/it] 53%\|██████████████████████████████████████████████████████████████████████████████▎ \| 140/263 [1:47:10<1:35:26, 46.56s/it] {'loss': '0.6586', 'grad_norm': '0.1947', 'learning_rate': '0.0001073', 'ppl': '1.932', 'memory/max_active (GiB)': '29.82', 'memory/max_allocated (GiB)': '29.82', 'memory/device_reserved (GiB)': '39.44', 'tokens/train_per_sec_per_gpu': '341.7', 'tokens/trainable': 1969155, 'tokens/total': 4302324, 'epoch': '0.5341'}
	53%\|██████████████████████████████████████████████████████████████████████████████▎ \| 140/263 [1:47:10<1:35:26, 46.56s/it] 54%\|██████████████████████████████████████████████████████████████████████████████▊ \| 141/263 [1:47:52<1:32:13, 45.36s/it] {'loss': '0.7005', 'grad_norm': '0.2028', 'learning_rate': '0.000106', 'ppl': '2.015', 'memory/max_active (GiB)': '30.7', 'memory/max_allocated (GiB)': '30.7', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '272.9', 'tokens/trainable': 1980769, 'tokens/total': 4330164, 'epoch': '0.5379'}
	54%\|██████████████████████████████████████████████████████████████████████████████▊ \| 141/263 [1:47:52<1:32:13, 45.36s/it] 54%\|███████████████████████████████████████████████████████████████████████████████▎ \| 142/263 [1:48:40<1:33:08, 46.19s/it] {'loss': '0.6936', 'grad_norm': '0.207', 'learning_rate': '0.0001046', 'ppl': '2.001', 'memory/max_active (GiB)': '30.38', 'memory/max_allocated (GiB)': '30.38', 'memory/device_reserved (GiB)': '40.1', 'tokens/train_per_sec_per_gpu': '213.5', 'tokens/trainable': 1991044, 'tokens/total': 4352668, 'epoch': '0.5417'}
	54%\|███████████████████████████████████████████████████████████████████████████████▎ \| 142/263 [1:48:40<1:33:08, 46.19s/it] 54%\|███████████████████████████████████████████████████████████████████████████████▉ \| 143/263 [1:49:28<1:32:57, 46.48s/it] {'loss': '0.7092', 'grad_norm': '0.1816', 'learning_rate': '0.0001033', 'ppl': '2.032', 'memory/max_active (GiB)': '33.08', 'memory/max_allocated (GiB)': '33.08', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '325.4', 'tokens/trainable': 2006393, 'tokens/total': 4386038, 'epoch': '0.5455'}
	54%\|███████████████████████████████████████████████████████████████████████████████▉ \| 143/263 [1:49:28<1:32:57, 46.48s/it] 55%\|████████████████████████████████████████████████████████████████████████████████▍ \| 144/263 [1:49:54<1:20:19, 40.50s/it] {'loss': '0.6535', 'grad_norm': '0.2587', 'learning_rate': '0.000102', 'ppl': '1.922', 'memory/max_active (GiB)': '24', 'memory/max_allocated (GiB)': '24', 'memory/device_reserved (GiB)': '30.59', 'tokens/train_per_sec_per_gpu': '326.8', 'tokens/trainable': 2015072, 'tokens/total': 4404946, 'epoch': '0.5494'}
	55%\|████████████████████████████████████████████████████████████████████████████████▍ \| 144/263 [1:49:54<1:20:19, 40.50s/it] 55%\|█████████████████████████████████████████████████████████████████████████████████ \| 145/263 [1:50:25<1:13:46, 37.51s/it] {'loss': '0.6609', 'grad_norm': '0.2187', 'learning_rate': '0.0001007', 'ppl': '1.937', 'memory/max_active (GiB)': '24.8', 'memory/max_allocated (GiB)': '24.8', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '344.2', 'tokens/trainable': 2025583, 'tokens/total': 4426472, 'epoch': '0.5532'}
	55%\|█████████████████████████████████████████████████████████████████████████████████ \| 145/263 [1:50:25<1:13:46, 37.51s/it] 56%\|█████████████████████████████████████████████████████████████████████████████████▌ \| 146/263 [1:51:10<1:17:32, 39.76s/it] {'loss': '0.6478', 'grad_norm': '0.2216', 'learning_rate': '9.934e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '28.07', 'memory/max_allocated (GiB)': '28.07', 'memory/device_reserved (GiB)': '36.78', 'tokens/train_per_sec_per_gpu': '212.7', 'tokens/trainable': 2035157, 'tokens/total': 4447532, 'epoch': '0.557'}
	56%\|█████████████████████████████████████████████████████████████████████████████████▌ \| 146/263 [1:51:10<1:17:32, 39.76s/it] 56%\|██████████████████████████████████████████████████████████████████████████████████▏ \| 147/263 [1:51:59<1:22:08, 42.49s/it] {'loss': '0.6784', 'grad_norm': '0.2246', 'learning_rate': '9.801e-05', 'ppl': '1.971', 'memory/max_active (GiB)': '24.88', 'memory/max_allocated (GiB)': '24.88', 'memory/device_reserved (GiB)': '31.92', 'tokens/train_per_sec_per_gpu': '225.8', 'tokens/trainable': 2046191, 'tokens/total': 4469110, 'epoch': '0.5608'}
	56%\|██████████████████████████████████████████████████████████████████████████████████▏ \| 147/263 [1:51:59<1:22:08, 42.49s/it] 56%\|██████████████████████████████████████████████████████████████████████████████████▋ \| 148/263 [1:52:52<1:27:37, 45.72s/it] {'loss': '0.6373', 'grad_norm': '0.1724', 'learning_rate': '9.669e-05', 'ppl': '1.891', 'memory/max_active (GiB)': '44.04', 'memory/max_allocated (GiB)': '44.04', 'memory/device_reserved (GiB)': '61.51', 'tokens/train_per_sec_per_gpu': '327.3', 'tokens/trainable': 2063617, 'tokens/total': 4505868, 'epoch': '0.5646'}
	56%\|██████████████████████████████████████████████████████████████████████████████████▋ \| 148/263 [1:52:52<1:27:37, 45.72s/it] 57%\|███████████████████████████████████████████████████████████████████████████████████▎ \| 149/263 [1:53:39<1:27:45, 46.19s/it] {'loss': '0.5898', 'grad_norm': '0.1733', 'learning_rate': '9.536e-05', 'ppl': '1.804', 'memory/max_active (GiB)': '31.09', 'memory/max_allocated (GiB)': '31.09', 'memory/device_reserved (GiB)': '37.29', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 2078116, 'tokens/total': 4537032, 'epoch': '0.5684'}
	57%\|███████████████████████████████████████████████████████████████████████████████████▎ \| 149/263 [1:53:39<1:27:45, 46.19s/it] 57%\|███████████████████████████████████████████████████████████████████████████████████▊ \| 150/263 [1:54:25<1:26:36, 45.98s/it] {'loss': '0.5857', 'grad_norm': '0.199', 'learning_rate': '9.404e-05', 'ppl': '1.796', 'memory/max_active (GiB)': '36.3', 'memory/max_allocated (GiB)': '36.3', 'memory/device_reserved (GiB)': '45.92', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 2089719, 'tokens/total': 4564268, 'epoch': '0.5722'}
	57%\|███████████████████████████████████████████████████████████████████████████████████▊ \| 150/263 [1:54:25<1:26:36, 45.98s/it] 57%\|████████████████████████████████████████████████████████████████████████████████████▍ \| 151/263 [1:55:16<1:29:04, 47.72s/it] {'loss': '0.6078', 'grad_norm': '0.2164', 'learning_rate': '9.272e-05', 'ppl': '1.836', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '259.3', 'tokens/trainable': 2103145, 'tokens/total': 4593966, 'epoch': '0.5761'}
	57%\|████████████████████████████████████████████████████████████████████████████████████▍ \| 151/263 [1:55:16<1:29:04, 47.72s/it] 58%\|████████████████████████████████████████████████████████████████████████████████████▉ \| 152/263 [1:56:03<1:27:31, 47.31s/it] {'loss': '0.7859', 'grad_norm': '0.2286', 'learning_rate': '9.139e-05', 'ppl': '2.194', 'memory/max_active (GiB)': '25.58', 'memory/max_allocated (GiB)': '25.58', 'memory/device_reserved (GiB)': '33.29', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 2113624, 'tokens/total': 4616602, 'epoch': '0.5799'}
	58%\|████████████████████████████████████████████████████████████████████████████████████▉ \| 152/263 [1:56:03<1:27:31, 47.31s/it] 58%\|█████████████████████████████████████████████████████████████████████████████████████▌ \| 153/263 [1:56:39<1:20:33, 43.94s/it] {'loss': '0.6501', 'grad_norm': '0.2131', 'learning_rate': '9.007e-05', 'ppl': '1.916', 'memory/max_active (GiB)': '23.47', 'memory/max_allocated (GiB)': '23.47', 'memory/device_reserved (GiB)': '29.78', 'tokens/train_per_sec_per_gpu': '255.8', 'tokens/trainable': 2122849, 'tokens/total': 4636788, 'epoch': '0.5837'}
	58%\|█████████████████████████████████████████████████████████████████████████████████████▌ \| 153/263 [1:56:39<1:20:33, 43.94s/it] 59%\|██████████████████████████████████████████████████████████████████████████████████████ \| 154/263 [1:57:30<1:23:40, 46.06s/it] {'loss': '0.7322', 'grad_norm': '0.2115', 'learning_rate': '8.876e-05', 'ppl': '2.08', 'memory/max_active (GiB)': '38.63', 'memory/max_allocated (GiB)': '38.63', 'memory/device_reserved (GiB)': '53.38', 'tokens/train_per_sec_per_gpu': '324', 'tokens/trainable': 2139377, 'tokens/total': 4672022, 'epoch': '0.5875'}
	59%\|██████████████████████████████████████████████████████████████████████████████████████ \| 154/263 [1:57:30<1:23:40, 46.06s/it] 59%\|██████████████████████████████████████████████████████████████████████████████████████▋ \| 155/263 [1:58:23<1:26:31, 48.07s/it] {'loss': '0.6393', 'grad_norm': '0.1979', 'learning_rate': '8.744e-05', 'ppl': '1.895', 'memory/max_active (GiB)': '39.73', 'memory/max_allocated (GiB)': '39.73', 'memory/device_reserved (GiB)': '54.95', 'tokens/train_per_sec_per_gpu': '278.7', 'tokens/trainable': 2154077, 'tokens/total': 4701432, 'epoch': '0.5913'}
	59%\|██████████████████████████████████████████████████████████████████████████████████████▋ \| 155/263 [1:58:23<1:26:31, 48.07s/it] 59%\|███████████████████████████████████████████████████████████████████████████████████████▏ \| 156/263 [1:59:20<1:30:31, 50.76s/it] {'loss': '0.6361', 'grad_norm': '0.1782', 'learning_rate': '8.613e-05', 'ppl': '1.889', 'memory/max_active (GiB)': '42.74', 'memory/max_allocated (GiB)': '42.74', 'memory/device_reserved (GiB)': '59.9', 'tokens/train_per_sec_per_gpu': '296.8', 'tokens/trainable': 2171008, 'tokens/total': 4736142, 'epoch': '0.5951'}
	59%\|███████████████████████████████████████████████████████████████████████████████████████▏ \| 156/263 [1:59:20<1:30:31, 50.76s/it] 60%\|███████████████████████████████████████████████████████████████████████████████████████▊ \| 157/263 [2:00:09<1:28:43, 50.22s/it] {'loss': '0.7478', 'grad_norm': '0.185', 'learning_rate': '8.481e-05', 'ppl': '2.112', 'memory/max_active (GiB)': '39.56', 'memory/max_allocated (GiB)': '39.56', 'memory/device_reserved (GiB)': '54.81', 'tokens/train_per_sec_per_gpu': '291.1', 'tokens/trainable': 2185260, 'tokens/total': 4769674, 'epoch': '0.599'}
	60%\|███████████████████████████████████████████████████████████████████████████████████████▊ \| 157/263 [2:00:09<1:28:43, 50.22s/it] 60%\|████████████████████████████████████████████████████████████████████████████████████████▎ \| 158/263 [2:01:16<1:36:55, 55.38s/it] {'loss': '0.6923', 'grad_norm': '0.1927', 'learning_rate': '8.351e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '51.52', 'memory/max_allocated (GiB)': '51.52', 'memory/device_reserved (GiB)': '73.17', 'tokens/train_per_sec_per_gpu': '299.4', 'tokens/trainable': 2205449, 'tokens/total': 4813544, 'epoch': '0.6028'}
	60%\|████████████████████████████████████████████████████████████████████████████████████████▎ \| 158/263 [2:01:16<1:36:55, 55.38s/it] 60%\|████████████████████████████████████████████████████████████████████████████████████████▊ \| 159/263 [2:02:19<1:39:59, 57.69s/it] {'loss': '0.6566', 'grad_norm': '0.1942', 'learning_rate': '8.22e-05', 'ppl': '1.928', 'memory/max_active (GiB)': '45.07', 'memory/max_allocated (GiB)': '45.07', 'memory/device_reserved (GiB)': '63.03', 'tokens/train_per_sec_per_gpu': '265.1', 'tokens/trainable': 2222171, 'tokens/total': 4848392, 'epoch': '0.6066'}
	60%\|████████████████████████████████████████████████████████████████████████████████████████▊ \| 159/263 [2:02:19<1:39:59, 57.69s/it] 61%\|█████████████████████████████████████████████████████████████████████████████████████████▍ \| 160/263 [2:03:11<1:36:19, 56.11s/it] {'loss': '0.6622', 'grad_norm': '0.1977', 'learning_rate': '8.09e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '46.42', 'memory/max_allocated (GiB)': '46.42', 'memory/device_reserved (GiB)': '65.06', 'tokens/train_per_sec_per_gpu': '337.5', 'tokens/trainable': 2239867, 'tokens/total': 4886222, 'epoch': '0.6104'}
	61%\|█████████████████████████████████████████████████████████████████████████████████████████▍ \| 160/263 [2:03:11<1:36:19, 56.11s/it] 61%\|█████████████████████████████████████████████████████████████████████████████████████████▉ \| 161/263 [2:03:47<1:24:41, 49.82s/it] {'loss': '0.6882', 'grad_norm': '0.1678', 'learning_rate': '7.96e-05', 'ppl': '1.99', 'memory/max_active (GiB)': '25.15', 'memory/max_allocated (GiB)': '25.15', 'memory/device_reserved (GiB)': '32.47', 'tokens/train_per_sec_per_gpu': '420.2', 'tokens/trainable': 2254631, 'tokens/total': 4914572, 'epoch': '0.6142'}
	61%\|█████████████████████████████████████████████████████████████████████████████████████████▉ \| 161/263 [2:03:47<1:24:41, 49.82s/it] 62%\|██████████████████████████████████████████████████████████████████████████████████████████▌ \| 162/263 [2:04:31<1:21:04, 48.17s/it] {'loss': '0.6634', 'grad_norm': '0.254', 'learning_rate': '7.83e-05', 'ppl': '1.941', 'memory/max_active (GiB)': '25.88', 'memory/max_allocated (GiB)': '25.88', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '195.9', 'tokens/trainable': 2263311, 'tokens/total': 4934728, 'epoch': '0.618'}
	62%\|██████████████████████████████████████████████████████████████████████████████████████████▌ \| 162/263 [2:04:31<1:21:04, 48.17s/it] 62%\|███████████████████████████████████████████████████████████████████████████████████████████ \| 163/263 [2:05:28<1:24:55, 50.96s/it] {'loss': '0.6232', 'grad_norm': '0.1764', 'learning_rate': '7.701e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '33.21', 'memory/max_allocated (GiB)': '33.21', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '270', 'tokens/trainable': 2278829, 'tokens/total': 4967958, 'epoch': '0.6218'}
	62%\|███████████████████████████████████████████████████████████████████████████████████████████ \| 163/263 [2:05:28<1:24:55, 50.96s/it] 62%\|███████████████████████████████████████████████████████████████████████████████████████████▋ \| 164/263 [2:06:01<1:14:46, 45.32s/it] {'loss': '0.7008', 'grad_norm': '0.1909', 'learning_rate': '7.572e-05', 'ppl': '2.015', 'memory/max_active (GiB)': '27.44', 'memory/max_allocated (GiB)': '27.44', 'memory/device_reserved (GiB)': '35.74', 'tokens/train_per_sec_per_gpu': '401.3', 'tokens/trainable': 2291731, 'tokens/total': 4994350, 'epoch': '0.6257'}
	62%\|███████████████████████████████████████████████████████████████████████████████████████████▋ \| 164/263 [2:06:01<1:14:46, 45.32s/it] 63%\|████████████████████████████████████████████████████████████████████████████████████████████▏ \| 165/263 [2:06:33<1:07:45, 41.49s/it] {'loss': '0.6839', 'grad_norm': '0.213', 'learning_rate': '7.444e-05', 'ppl': '1.982', 'memory/max_active (GiB)': '31.91', 'memory/max_allocated (GiB)': '31.91', 'memory/device_reserved (GiB)': '42.43', 'tokens/train_per_sec_per_gpu': '404.2', 'tokens/trainable': 2304893, 'tokens/total': 5018514, 'epoch': '0.6295'}
	63%\|████████████████████████████████████████████████████████████████████████████████████████████▏ \| 165/263 [2:06:33<1:07:45, 41.49s/it] 63%\|████████████████████████████████████████████████████████████████████████████████████████████▊ \| 166/263 [2:07:29<1:13:54, 45.72s/it] {'loss': '0.6054', 'grad_norm': '0.1739', 'learning_rate': '7.316e-05', 'ppl': '1.832', 'memory/max_active (GiB)': '35.17', 'memory/max_allocated (GiB)': '35.17', 'memory/device_reserved (GiB)': '44.2', 'tokens/train_per_sec_per_gpu': '285.8', 'tokens/trainable': 2320783, 'tokens/total': 5050822, 'epoch': '0.6333'}
	63%\|████████████████████████████████████████████████████████████████████████████████████████████▊ \| 166/263 [2:07:29<1:13:54, 45.72s/it] 63%\|█████████████████████████████████████████████████████████████████████████████████████████████▎ \| 167/263 [2:08:27<1:19:11, 49.50s/it] {'loss': '0.6887', 'grad_norm': '0.1949', 'learning_rate': '7.188e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '40.38', 'memory/max_allocated (GiB)': '40.38', 'memory/device_reserved (GiB)': '56.06', 'tokens/train_per_sec_per_gpu': '304.9', 'tokens/trainable': 2338559, 'tokens/total': 5082648, 'epoch': '0.6371'}
	63%\|█████████████████████████████████████████████████████████████████████████████████████████████▎ \| 167/263 [2:08:27<1:19:11, 49.50s/it] 64%\|█████████████████████████████████████████████████████████████████████████████████████████████▉ \| 168/263 [2:09:09<1:14:44, 47.20s/it] {'loss': '0.6576', 'grad_norm': '0.246', 'learning_rate': '7.061e-05', 'ppl': '1.93', 'memory/max_active (GiB)': '22.77', 'memory/max_allocated (GiB)': '22.77', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '217.1', 'tokens/trainable': 2347641, 'tokens/total': 5103024, 'epoch': '0.6409'}
	64%\|█████████████████████████████████████████████████████████████████████████████████████████████▉ \| 168/263 [2:09:09<1:14:44, 47.20s/it] 64%\|██████████████████████████████████████████████████████████████████████████████████████████████▍ \| 169/263 [2:09:58<1:14:40, 47.66s/it] {'loss': '0.5939', 'grad_norm': '0.1881', 'learning_rate': '6.935e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.16', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 2362246, 'tokens/total': 5135012, 'epoch': '0.6447'}
	64%\|██████████████████████████████████████████████████████████████████████████████████████████████▍ \| 169/263 [2:09:58<1:14:40, 47.66s/it] 65%\|███████████████████████████████████████████████████████████████████████████████████████████████ \| 170/263 [2:10:42<1:12:29, 46.77s/it] {'loss': '0.6675', 'grad_norm': '0.1941', 'learning_rate': '6.809e-05', 'ppl': '1.949', 'memory/max_active (GiB)': '40.07', 'memory/max_allocated (GiB)': '40.07', 'memory/device_reserved (GiB)': '55.49', 'tokens/train_per_sec_per_gpu': '330.2', 'tokens/trainable': 2376998, 'tokens/total': 5165914, 'epoch': '0.6485'}
	65%\|███████████████████████████████████████████████████████████████████████████████████████████████ \| 170/263 [2:10:42<1:12:29, 46.77s/it] 65%\|███████████████████████████████████████████████████████████████████████████████████████████████▌ \| 171/263 [2:11:22<1:08:18, 44.54s/it] {'loss': '0.662', 'grad_norm': '0.2005', 'learning_rate': '6.684e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '26.99', 'memory/max_allocated (GiB)': '26.99', 'memory/device_reserved (GiB)': '35.18', 'tokens/train_per_sec_per_gpu': '334.3', 'tokens/trainable': 2390156, 'tokens/total': 5192072, 'epoch': '0.6524'}
	65%\|███████████████████████████████████████████████████████████████████████████████████████████████▌ \| 171/263 [2:11:22<1:08:18, 44.54s/it] 65%\|████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 172/263 [2:12:13<1:10:49, 46.70s/it] {'loss': '0.7026', 'grad_norm': '0.2259', 'learning_rate': '6.559e-05', 'ppl': '2.019', 'memory/max_active (GiB)': '47.05', 'memory/max_allocated (GiB)': '47.05', 'memory/device_reserved (GiB)': '66.2', 'tokens/train_per_sec_per_gpu': '264.1', 'tokens/trainable': 2403814, 'tokens/total': 5223520, 'epoch': '0.6562'}
	65%\|████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 172/263 [2:12:13<1:10:49, 46.70s/it] 66%\|████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 173/263 [2:13:04<1:11:49, 47.89s/it] {'loss': '0.6707', 'grad_norm': '0.197', 'learning_rate': '6.435e-05', 'ppl': '1.956', 'memory/max_active (GiB)': '23.56', 'memory/max_allocated (GiB)': '23.56', 'memory/device_reserved (GiB)': '30.05', 'tokens/train_per_sec_per_gpu': '217.8', 'tokens/trainable': 2414847, 'tokens/total': 5247654, 'epoch': '0.66'}
	66%\|████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 173/263 [2:13:04<1:11:49, 47.89s/it] 66%\|█████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 174/263 [2:13:57<1:13:05, 49.28s/it] {'loss': '0.5625', 'grad_norm': '0.1798', 'learning_rate': '6.311e-05', 'ppl': '1.755', 'memory/max_active (GiB)': '30.71', 'memory/max_allocated (GiB)': '30.71', 'memory/device_reserved (GiB)': '40.59', 'tokens/train_per_sec_per_gpu': '295.5', 'tokens/trainable': 2430368, 'tokens/total': 5276410, 'epoch': '0.6638'}
	66%\|█████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 174/263 [2:13:57<1:13:05, 49.28s/it] 67%\|█████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 175/263 [2:14:44<1:11:22, 48.66s/it] {'loss': '0.6375', 'grad_norm': '0.1961', 'learning_rate': '6.188e-05', 'ppl': '1.892', 'memory/max_active (GiB)': '28.14', 'memory/max_allocated (GiB)': '28.14', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '234.5', 'tokens/trainable': 2441441, 'tokens/total': 5302790, 'epoch': '0.6676'}
	67%\|█████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 175/263 [2:14:44<1:11:22, 48.66s/it] 67%\|██████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 176/263 [2:15:37<1:12:44, 50.17s/it] {'loss': '0.6539', 'grad_norm': '0.1857', 'learning_rate': '6.066e-05', 'ppl': '1.923', 'memory/max_active (GiB)': '44.34', 'memory/max_allocated (GiB)': '44.34', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '260.5', 'tokens/trainable': 2455423, 'tokens/total': 5333674, 'epoch': '0.6714'}
	67%\|██████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 176/263 [2:15:37<1:12:44, 50.17s/it] 67%\|██████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 177/263 [2:16:24<1:10:15, 49.01s/it] {'loss': '0.5762', 'grad_norm': '0.1929', 'learning_rate': '5.945e-05', 'ppl': '1.779', 'memory/max_active (GiB)': '28.62', 'memory/max_allocated (GiB)': '28.62', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '251.4', 'tokens/trainable': 2467068, 'tokens/total': 5359564, 'epoch': '0.6753'}
	67%\|██████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 177/263 [2:16:24<1:10:15, 49.01s/it] 68%\|███████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 178/263 [2:17:02<1:05:00, 45.89s/it] {'loss': '0.6303', 'grad_norm': '0.1936', 'learning_rate': '5.824e-05', 'ppl': '1.878', 'memory/max_active (GiB)': '33.46', 'memory/max_allocated (GiB)': '33.46', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '384.4', 'tokens/trainable': 2481907, 'tokens/total': 5388678, 'epoch': '0.6791'}
	68%\|███████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 178/263 [2:17:02<1:05:00, 45.89s/it] 68%\|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 179/263 [2:17:37<59:27, 42.47s/it] {'loss': '0.6012', 'grad_norm': '0.203', 'learning_rate': '5.704e-05', 'ppl': '1.824', 'memory/max_active (GiB)': '25.76', 'memory/max_allocated (GiB)': '25.76', 'memory/device_reserved (GiB)': '33.21', 'tokens/train_per_sec_per_gpu': '338.2', 'tokens/trainable': 2493569, 'tokens/total': 5414394, 'epoch': '0.6829'}
	68%\|█████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 179/263 [2:17:37<59:27, 42.47s/it] 68%\|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 180/263 [2:18:10<54:55, 39.70s/it] {'loss': '0.6847', 'grad_norm': '0.1948', 'learning_rate': '5.585e-05', 'ppl': '1.983', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '404.4', 'tokens/trainable': 2507016, 'tokens/total': 5440266, 'epoch': '0.6867'}
	68%\|█████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 180/263 [2:18:10<54:55, 39.70s/it] 69%\|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 181/263 [2:18:44<51:47, 37.90s/it] {'loss': '0.6923', 'grad_norm': '0.2085', 'learning_rate': '5.466e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '35.58', 'memory/max_allocated (GiB)': '35.58', 'memory/device_reserved (GiB)': '44.79', 'tokens/train_per_sec_per_gpu': '367.1', 'tokens/trainable': 2519381, 'tokens/total': 5467466, 'epoch': '0.6905'}
	69%\|██████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 181/263 [2:18:44<51:47, 37.90s/it] 69%\|███████████████████████████████████████████████████████████████████████████████████████████████████████ \| 182/263 [2:19:19<50:05, 37.10s/it] {'loss': '0.7096', 'grad_norm': '0.196', 'learning_rate': '5.348e-05', 'ppl': '2.033', 'memory/max_active (GiB)': '30.89', 'memory/max_allocated (GiB)': '30.89', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '379.8', 'tokens/trainable': 2532772, 'tokens/total': 5497208, 'epoch': '0.6943'}
	69%\|███████████████████████████████████████████████████████████████████████████████████████████████████████ \| 182/263 [2:19:19<50:05, 37.10s/it] 70%\|███████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 183/263 [2:19:53<48:13, 36.17s/it] {'loss': '0.676', 'grad_norm': '0.2051', 'learning_rate': '5.231e-05', 'ppl': '1.966', 'memory/max_active (GiB)': '24.11', 'memory/max_allocated (GiB)': '24.11', 'memory/device_reserved (GiB)': '38.66', 'tokens/train_per_sec_per_gpu': '345.2', 'tokens/trainable': 2544502, 'tokens/total': 5521580, 'epoch': '0.6981'}
	70%\|███████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 183/263 [2:19:53<48:13, 36.17s/it] 70%\|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 184/263 [2:20:43<52:59, 40.25s/it] {'loss': '0.6542', 'grad_norm': '0.2071', 'learning_rate': '5.115e-05', 'ppl': '1.924', 'memory/max_active (GiB)': '39.48', 'memory/max_allocated (GiB)': '39.48', 'memory/device_reserved (GiB)': '54.69', 'tokens/train_per_sec_per_gpu': '310.1', 'tokens/trainable': 2559938, 'tokens/total': 5553304, 'epoch': '0.702'}
	70%\|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 184/263 [2:20:43<52:59, 40.25s/it] 70%\|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 185/263 [2:21:23<52:24, 40.31s/it] {'loss': '0.6775', 'grad_norm': '0.2571', 'learning_rate': '5e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '38.61', 'memory/max_allocated (GiB)': '38.61', 'memory/device_reserved (GiB)': '53.19', 'tokens/train_per_sec_per_gpu': '308.9', 'tokens/trainable': 2572432, 'tokens/total': 5582308, 'epoch': '0.7058'}
	70%\|████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 185/263 [2:21:23<52:24, 40.31s/it] 71%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 186/263 [2:22:13<55:23, 43.16s/it] {'loss': '0.6412', 'grad_norm': '0.2141', 'learning_rate': '4.886e-05', 'ppl': '1.899', 'memory/max_active (GiB)': '50.63', 'memory/max_allocated (GiB)': '50.63', 'memory/device_reserved (GiB)': '71.62', 'tokens/train_per_sec_per_gpu': '345.7', 'tokens/trainable': 2589650, 'tokens/total': 5622242, 'epoch': '0.7096'}
	71%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 186/263 [2:22:13<55:23, 43.16s/it] 71%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 187/263 [2:22:59<55:44, 44.01s/it] {'loss': '0.6154', 'grad_norm': '0.1996', 'learning_rate': '4.772e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '33.16', 'memory/max_allocated (GiB)': '33.16', 'memory/device_reserved (GiB)': '70.77', 'tokens/train_per_sec_per_gpu': '289.1', 'tokens/trainable': 2602951, 'tokens/total': 5652818, 'epoch': '0.7134'}
	71%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 187/263 [2:22:59<55:44, 44.01s/it] 71%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 188/263 [2:23:52<58:33, 46.84s/it] {'loss': '0.6231', 'grad_norm': '0.2165', 'learning_rate': '4.66e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '55.1', 'memory/max_allocated (GiB)': '55.1', 'memory/device_reserved (GiB)': '78.44', 'tokens/train_per_sec_per_gpu': '241.5', 'tokens/trainable': 2615857, 'tokens/total': 5682190, 'epoch': '0.7172'}
	71%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 188/263 [2:23:52<58:33, 46.84s/it] 72%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 189/263 [2:24:51<1:02:07, 50.37s/it] {'loss': '0.644', 'grad_norm': '0.1827', 'learning_rate': '4.548e-05', 'ppl': '1.904', 'memory/max_active (GiB)': '40.42', 'memory/max_allocated (GiB)': '40.42', 'memory/device_reserved (GiB)': '56.19', 'tokens/train_per_sec_per_gpu': '308.8', 'tokens/trainable': 2633955, 'tokens/total': 5718572, 'epoch': '0.721'}
	72%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 189/263 [2:24:51<1:02:07, 50.37s/it] 72%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 190/263 [2:25:38<59:58, 49.29s/it] {'loss': '0.6435', 'grad_norm': '0.2112', 'learning_rate': '4.437e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '39.61', 'memory/max_allocated (GiB)': '39.61', 'memory/device_reserved (GiB)': '54.75', 'tokens/train_per_sec_per_gpu': '314.9', 'tokens/trainable': 2648683, 'tokens/total': 5751500, 'epoch': '0.7248'}
	72%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 190/263 [2:25:38<59:58, 49.29s/it] 73%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 191/263 [2:26:29<59:50, 49.87s/it] {'loss': '0.6434', 'grad_norm': '0.1974', 'learning_rate': '4.328e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '48.38', 'memory/max_allocated (GiB)': '48.38', 'memory/device_reserved (GiB)': '68.25', 'tokens/train_per_sec_per_gpu': '370.3', 'tokens/trainable': 2667651, 'tokens/total': 5795452, 'epoch': '0.7287'}
	73%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 191/263 [2:26:29<59:50, 49.87s/it] 73%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 192/263 [2:27:33<1:04:07, 54.19s/it] {'loss': '0.7199', 'grad_norm': '0.1943', 'learning_rate': '4.219e-05', 'ppl': '2.054', 'memory/max_active (GiB)': '51.55', 'memory/max_allocated (GiB)': '51.55', 'memory/device_reserved (GiB)': '72.91', 'tokens/train_per_sec_per_gpu': '241.4', 'tokens/trainable': 2683164, 'tokens/total': 5831970, 'epoch': '0.7325'}
	73%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 192/263 [2:27:33<1:04:07, 54.19s/it] 73%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 193/263 [2:28:31<1:04:25, 55.22s/it] {'loss': '0.6151', 'grad_norm': '0.212', 'learning_rate': '4.111e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '45.42', 'memory/max_allocated (GiB)': '45.42', 'memory/device_reserved (GiB)': '63.73', 'tokens/train_per_sec_per_gpu': '237.2', 'tokens/trainable': 2696830, 'tokens/total': 5864046, 'epoch': '0.7363'}
	73%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 193/263 [2:28:31<1:04:25, 55.22s/it] 74%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 194/263 [2:29:15<59:32, 51.78s/it] {'loss': '0.662', 'grad_norm': '0.1948', 'learning_rate': '4.005e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '36.93', 'memory/max_allocated (GiB)': '36.93', 'memory/device_reserved (GiB)': '46.86', 'tokens/train_per_sec_per_gpu': '370.9', 'tokens/trainable': 2713056, 'tokens/total': 5897816, 'epoch': '0.7401'}
	74%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 194/263 [2:29:15<59:32, 51.78s/it] 74%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 195/263 [2:29:58<55:48, 49.24s/it] {'loss': '0.7405', 'grad_norm': '0.2793', 'learning_rate': '3.899e-05', 'ppl': '2.097', 'memory/max_active (GiB)': '25.49', 'memory/max_allocated (GiB)': '25.49', 'memory/device_reserved (GiB)': '32.93', 'tokens/train_per_sec_per_gpu': '231.7', 'tokens/trainable': 2723094, 'tokens/total': 5917064, 'epoch': '0.7439'}
	74%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 195/263 [2:29:58<55:48, 49.24s/it] 75%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 196/263 [2:30:43<53:30, 47.92s/it] {'loss': '0.6287', 'grad_norm': '0.2298', 'learning_rate': '3.795e-05', 'ppl': '1.875', 'memory/max_active (GiB)': '21.3', 'memory/max_allocated (GiB)': '21.3', 'memory/device_reserved (GiB)': '26.55', 'tokens/train_per_sec_per_gpu': '184.1', 'tokens/trainable': 2731350, 'tokens/total': 5936682, 'epoch': '0.7477'}
	75%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 196/263 [2:30:43<53:30, 47.92s/it] 75%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 197/263 [2:31:37<54:37, 49.66s/it] {'loss': '0.6935', 'grad_norm': '0.2097', 'learning_rate': '3.691e-05', 'ppl': '2.001', 'memory/max_active (GiB)': '30.31', 'memory/max_allocated (GiB)': '30.31', 'memory/device_reserved (GiB)': '40.12', 'tokens/train_per_sec_per_gpu': '254.8', 'tokens/trainable': 2745037, 'tokens/total': 5968342, 'epoch': '0.7515'}
	75%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 197/263 [2:31:37<54:37, 49.66s/it] 75%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 198/263 [2:32:29<54:33, 50.36s/it] {'loss': '0.6516', 'grad_norm': '0.2135', 'learning_rate': '3.589e-05', 'ppl': '1.919', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.11', 'tokens/train_per_sec_per_gpu': '247.8', 'tokens/trainable': 2757923, 'tokens/total': 5997292, 'epoch': '0.7554'}
	75%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 198/263 [2:32:29<54:33, 50.36s/it] 76%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 199/263 [2:33:29<56:57, 53.40s/it] {'loss': '0.6918', 'grad_norm': '0.1828', 'learning_rate': '3.488e-05', 'ppl': '1.997', 'memory/max_active (GiB)': '56.91', 'memory/max_allocated (GiB)': '56.91', 'memory/device_reserved (GiB)': '73.99', 'tokens/train_per_sec_per_gpu': '302.4', 'tokens/trainable': 2776213, 'tokens/total': 6038550, 'epoch': '0.7592'}
	76%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 199/263 [2:33:29<56:57, 53.40s/it] 76%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 200/263 [2:34:22<56:02, 53.37s/it] {'loss': '0.6774', 'grad_norm': '0.2068', 'learning_rate': '3.388e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '43.54', 'memory/max_allocated (GiB)': '43.54', 'memory/device_reserved (GiB)': '60.76', 'tokens/train_per_sec_per_gpu': '271.4', 'tokens/trainable': 2790680, 'tokens/total': 6069138, 'epoch': '0.763'}
	76%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 200/263 [2:34:22<56:02, 53.37s/it] 76%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 201/263 [2:35:22<57:08, 55.30s/it] {'loss': '0.6853', 'grad_norm': '0.2284', 'learning_rate': '3.289e-05', 'ppl': '1.984', 'memory/max_active (GiB)': '55.21', 'memory/max_allocated (GiB)': '55.21', 'memory/device_reserved (GiB)': '78.48', 'tokens/train_per_sec_per_gpu': '256.8', 'tokens/trainable': 2806039, 'tokens/total': 6104430, 'epoch': '0.7668'}
	76%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 201/263 [2:35:22<57:08, 55.30s/it] 77%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 202/263 [2:35:58<50:17, 49.47s/it] {'loss': '0.7109', 'grad_norm': '0.2175', 'learning_rate': '3.191e-05', 'ppl': '2.036', 'memory/max_active (GiB)': '23.78', 'memory/max_allocated (GiB)': '23.78', 'memory/device_reserved (GiB)': '30.28', 'tokens/train_per_sec_per_gpu': '336.1', 'tokens/trainable': 2818090, 'tokens/total': 6127762, 'epoch': '0.7706'}
	77%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 202/263 [2:35:58<50:17, 49.47s/it] 77%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 203/263 [2:36:28<43:45, 43.75s/it] {'loss': '0.6929', 'grad_norm': '0.2504', 'learning_rate': '3.095e-05', 'ppl': '2', 'memory/max_active (GiB)': '27.24', 'memory/max_allocated (GiB)': '27.24', 'memory/device_reserved (GiB)': '30.67', 'tokens/train_per_sec_per_gpu': '347.7', 'tokens/trainable': 2828664, 'tokens/total': 6151536, 'epoch': '0.7744'}
	77%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 203/263 [2:36:28<43:45, 43.75s/it] 78%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 204/263 [2:37:16<44:12, 44.96s/it] {'loss': '0.6381', 'grad_norm': '0.1846', 'learning_rate': '3e-05', 'ppl': '1.893', 'memory/max_active (GiB)': '35.73', 'memory/max_allocated (GiB)': '35.73', 'memory/device_reserved (GiB)': '45.1', 'tokens/train_per_sec_per_gpu': '335.6', 'tokens/trainable': 2844697, 'tokens/total': 6186540, 'epoch': '0.7783'}
	78%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 204/263 [2:37:16<44:12, 44.96s/it] 78%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 205/263 [2:38:09<45:46, 47.35s/it] {'loss': '0.578', 'grad_norm': '0.1669', 'learning_rate': '2.906e-05', 'ppl': '1.783', 'memory/max_active (GiB)': '38.6', 'memory/max_allocated (GiB)': '38.6', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '300.8', 'tokens/trainable': 2860615, 'tokens/total': 6225342, 'epoch': '0.7821'}
	78%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 205/263 [2:38:09<45:46, 47.35s/it] 78%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 206/263 [2:38:55<44:36, 46.95s/it] {'loss': '0.7123', 'grad_norm': '0.2747', 'learning_rate': '2.813e-05', 'ppl': '2.039', 'memory/max_active (GiB)': '33.82', 'memory/max_allocated (GiB)': '33.82', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '214.9', 'tokens/trainable': 2870504, 'tokens/total': 6249862, 'epoch': '0.7859'}
	78%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 206/263 [2:38:55<44:36, 46.95s/it] 79%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 207/263 [2:39:38<42:35, 45.64s/it] {'loss': '0.5941', 'grad_norm': '0.198', 'learning_rate': '2.721e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '42.35', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 2883356, 'tokens/total': 6278118, 'epoch': '0.7897'}
	79%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 207/263 [2:39:38<42:35, 45.64s/it] 79%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 208/263 [2:40:17<40:11, 43.85s/it] {'loss': '0.7291', 'grad_norm': '0.1958', 'learning_rate': '2.631e-05', 'ppl': '2.073', 'memory/max_active (GiB)': '29.79', 'memory/max_allocated (GiB)': '29.79', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '300.4', 'tokens/trainable': 2895280, 'tokens/total': 6304026, 'epoch': '0.7935'}
	79%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 208/263 [2:40:17<40:11, 43.85s/it] 79%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 209/263 [2:40:56<38:06, 42.35s/it] {'loss': '0.6712', 'grad_norm': '0.209', 'learning_rate': '2.542e-05', 'ppl': '1.957', 'memory/max_active (GiB)': '28.91', 'memory/max_allocated (GiB)': '28.91', 'memory/device_reserved (GiB)': '37.99', 'tokens/train_per_sec_per_gpu': '305.4', 'tokens/trainable': 2907146, 'tokens/total': 6329922, 'epoch': '0.7973'}
	79%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 209/263 [2:40:56<38:06, 42.35s/it] 80%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 210/263 [2:41:38<37:14, 42.17s/it] {'loss': '0.6474', 'grad_norm': '0.2265', 'learning_rate': '2.454e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '33.6', 'memory/max_allocated (GiB)': '33.6', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '257.8', 'tokens/trainable': 2917904, 'tokens/total': 6354510, 'epoch': '0.8011'}
	80%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 210/263 [2:41:38<37:14, 42.17s/it] 80%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 211/263 [2:42:15<35:18, 40.74s/it] {'loss': '0.6219', 'grad_norm': '0.1866', 'learning_rate': '2.368e-05', 'ppl': '1.863', 'memory/max_active (GiB)': '23.98', 'memory/max_allocated (GiB)': '23.98', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 2929254, 'tokens/total': 6379050, 'epoch': '0.805'}
	80%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 211/263 [2:42:15<35:18, 40.74s/it] 81%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 212/263 [2:43:06<37:06, 43.66s/it] {'loss': '0.5683', 'grad_norm': '0.1713', 'learning_rate': '2.283e-05', 'ppl': '1.765', 'memory/max_active (GiB)': '36.5', 'memory/max_allocated (GiB)': '36.5', 'memory/device_reserved (GiB)': '46.02', 'tokens/train_per_sec_per_gpu': '342.5', 'tokens/trainable': 2946538, 'tokens/total': 6416330, 'epoch': '0.8088'}
	81%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 212/263 [2:43:06<37:06, 43.66s/it] 81%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 213/263 [2:44:05<40:19, 48.39s/it] {'loss': '0.6341', 'grad_norm': '0.1864', 'learning_rate': '2.199e-05', 'ppl': '1.885', 'memory/max_active (GiB)': '52.19', 'memory/max_allocated (GiB)': '52.19', 'memory/device_reserved (GiB)': '74.01', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 2962875, 'tokens/total': 6452688, 'epoch': '0.8126'}
	81%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 213/263 [2:44:05<40:19, 48.39s/it] 81%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 214/263 [2:44:59<40:41, 49.83s/it] {'loss': '0.6799', 'grad_norm': '0.2418', 'learning_rate': '2.117e-05', 'ppl': '1.974', 'memory/max_active (GiB)': '33.52', 'memory/max_allocated (GiB)': '33.52', 'memory/device_reserved (GiB)': '42', 'tokens/train_per_sec_per_gpu': '259.6', 'tokens/trainable': 2976675, 'tokens/total': 6485638, 'epoch': '0.8164'}
	81%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 214/263 [2:44:59<40:41, 49.83s/it] 82%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 215/263 [2:45:35<36:33, 45.70s/it] {'loss': '0.5868', 'grad_norm': '0.2032', 'learning_rate': '2.036e-05', 'ppl': '1.798', 'memory/max_active (GiB)': '24.76', 'memory/max_allocated (GiB)': '24.76', 'memory/device_reserved (GiB)': '31.8', 'tokens/train_per_sec_per_gpu': '283.4', 'tokens/trainable': 2986893, 'tokens/total': 6509020, 'epoch': '0.8202'}
	82%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 215/263 [2:45:35<36:33, 45.70s/it] 82%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 216/263 [2:46:18<35:20, 45.12s/it] {'loss': '0.7066', 'grad_norm': '0.2272', 'learning_rate': '1.957e-05', 'ppl': '2.027', 'memory/max_active (GiB)': '34.06', 'memory/max_allocated (GiB)': '34.06', 'memory/device_reserved (GiB)': '42.78', 'tokens/train_per_sec_per_gpu': '281.2', 'tokens/trainable': 2999202, 'tokens/total': 6537048, 'epoch': '0.824'}
	82%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 216/263 [2:46:18<35:20, 45.12s/it] 83%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 217/263 [2:46:58<33:17, 43.41s/it] {'loss': '0.6148', 'grad_norm': '0.1853', 'learning_rate': '1.879e-05', 'ppl': '1.849', 'memory/max_active (GiB)': '27.75', 'memory/max_allocated (GiB)': '27.75', 'memory/device_reserved (GiB)': '36.31', 'tokens/train_per_sec_per_gpu': '309.6', 'tokens/trainable': 3011411, 'tokens/total': 6563400, 'epoch': '0.8278'}
	83%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 217/263 [2:46:58<33:17, 43.41s/it] 83%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 218/263 [2:47:46<33:43, 44.97s/it] {'loss': '0.6522', 'grad_norm': '0.1957', 'learning_rate': '1.802e-05', 'ppl': '1.92', 'memory/max_active (GiB)': '31.19', 'memory/max_allocated (GiB)': '31.19', 'memory/device_reserved (GiB)': '41.37', 'tokens/train_per_sec_per_gpu': '279.6', 'tokens/trainable': 3024998, 'tokens/total': 6593488, 'epoch': '0.8317'}
	83%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 218/263 [2:47:46<33:43, 44.97s/it] 83%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 219/263 [2:48:33<33:24, 45.57s/it] {'loss': '0.6208', 'grad_norm': '0.1843', 'learning_rate': '1.727e-05', 'ppl': '1.86', 'memory/max_active (GiB)': '30.24', 'memory/max_allocated (GiB)': '30.24', 'memory/device_reserved (GiB)': '39.91', 'tokens/train_per_sec_per_gpu': '332.7', 'tokens/trainable': 3040622, 'tokens/total': 6624140, 'epoch': '0.8355'}
	83%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 219/263 [2:48:33<33:24, 45.57s/it] 84%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 220/263 [2:49:22<33:25, 46.63s/it] {'loss': '0.6097', 'grad_norm': '0.1874', 'learning_rate': '1.653e-05', 'ppl': '1.84', 'memory/max_active (GiB)': '37.68', 'memory/max_allocated (GiB)': '37.68', 'memory/device_reserved (GiB)': '47.76', 'tokens/train_per_sec_per_gpu': '277.7', 'tokens/trainable': 3054263, 'tokens/total': 6654518, 'epoch': '0.8393'}
	84%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 220/263 [2:49:22<33:25, 46.63s/it] 84%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 221/263 [2:50:14<33:34, 47.97s/it] {'loss': '0.6886', 'grad_norm': '0.2007', 'learning_rate': '1.581e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '29.37', 'memory/max_allocated (GiB)': '29.37', 'memory/device_reserved (GiB)': '38.64', 'tokens/train_per_sec_per_gpu': '249.6', 'tokens/trainable': 3067017, 'tokens/total': 6683614, 'epoch': '0.8431'}
	84%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 221/263 [2:50:14<33:34, 47.97s/it] 84%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 222/263 [2:51:01<32:41, 47.83s/it] {'loss': '0.629', 'grad_norm': '0.2061', 'learning_rate': '1.51e-05', 'ppl': '1.876', 'memory/max_active (GiB)': '47.52', 'memory/max_allocated (GiB)': '47.52', 'memory/device_reserved (GiB)': '66.84', 'tokens/train_per_sec_per_gpu': '305', 'tokens/trainable': 3081510, 'tokens/total': 6718874, 'epoch': '0.8469'}
	84%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 222/263 [2:51:01<32:41, 47.83s/it] 85%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 223/263 [2:51:45<31:07, 46.68s/it] {'loss': '0.6047', 'grad_norm': '0.1965', 'learning_rate': '1.441e-05', 'ppl': '1.831', 'memory/max_active (GiB)': '37.79', 'memory/max_allocated (GiB)': '37.79', 'memory/device_reserved (GiB)': '47.88', 'tokens/train_per_sec_per_gpu': '297.3', 'tokens/trainable': 3094585, 'tokens/total': 6747542, 'epoch': '0.8507'}
	85%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 223/263 [2:51:45<31:07, 46.68s/it] 85%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 224/263 [2:52:53<34:29, 53.05s/it] {'loss': '0.6172', 'grad_norm': '0.1745', 'learning_rate': '1.373e-05', 'ppl': '1.854', 'memory/max_active (GiB)': '48.12', 'memory/max_allocated (GiB)': '48.12', 'memory/device_reserved (GiB)': '67.76', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 3114552, 'tokens/total': 6793276, 'epoch': '0.8546'}
	85%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 224/263 [2:52:53<34:29, 53.05s/it] 86%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 225/263 [2:53:48<33:52, 53.49s/it] {'loss': '0.6193', 'grad_norm': '0.2026', 'learning_rate': '1.307e-05', 'ppl': '1.858', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.99', 'tokens/train_per_sec_per_gpu': '257.3', 'tokens/trainable': 3128578, 'tokens/total': 6823484, 'epoch': '0.8584'}
	86%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 225/263 [2:53:48<33:52, 53.49s/it] 86%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 226/263 [2:54:34<31:37, 51.28s/it] {'loss': '0.6827', 'grad_norm': '0.2373', 'learning_rate': '1.242e-05', 'ppl': '1.979', 'memory/max_active (GiB)': '39.13', 'memory/max_allocated (GiB)': '39.13', 'memory/device_reserved (GiB)': '54.05', 'tokens/train_per_sec_per_gpu': '285.3', 'tokens/trainable': 3141735, 'tokens/total': 6855924, 'epoch': '0.8622'}
	86%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 226/263 [2:54:34<31:37, 51.28s/it] 86%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 227/263 [2:55:15<29:03, 48.44s/it] {'loss': '0.6604', 'grad_norm': '0.1903', 'learning_rate': '1.179e-05', 'ppl': '1.935', 'memory/max_active (GiB)': '30.3', 'memory/max_allocated (GiB)': '30.3', 'memory/device_reserved (GiB)': '40.04', 'tokens/train_per_sec_per_gpu': '315.4', 'tokens/trainable': 3154923, 'tokens/total': 6884738, 'epoch': '0.866'}
	86%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 227/263 [2:55:15<29:03, 48.44s/it] 87%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 228/263 [2:55:58<27:15, 46.74s/it] {'loss': '0.6514', 'grad_norm': '0.2107', 'learning_rate': '1.117e-05', 'ppl': '1.918', 'memory/max_active (GiB)': '22.9', 'memory/max_allocated (GiB)': '22.9', 'memory/device_reserved (GiB)': '34.14', 'tokens/train_per_sec_per_gpu': '252.7', 'tokens/trainable': 3165734, 'tokens/total': 6907742, 'epoch': '0.8698'}
	87%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 228/263 [2:55:58<27:15, 46.74s/it] 87%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 229/263 [2:56:54<27:58, 49.37s/it] {'loss': '0.6451', 'grad_norm': '0.1984', 'learning_rate': '1.057e-05', 'ppl': '1.906', 'memory/max_active (GiB)': '43.49', 'memory/max_allocated (GiB)': '43.49', 'memory/device_reserved (GiB)': '60.71', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 3180849, 'tokens/total': 6937996, 'epoch': '0.8736'}
	87%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 229/263 [2:56:54<27:58, 49.37s/it] 87%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 230/263 [2:57:41<26:45, 48.65s/it] {'loss': '0.6366', 'grad_norm': '0.1864', 'learning_rate': '9.985e-06', 'ppl': '1.89', 'memory/max_active (GiB)': '35.46', 'memory/max_allocated (GiB)': '35.46', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '330', 'tokens/trainable': 3196343, 'tokens/total': 6970136, 'epoch': '0.8774'}
	87%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 230/263 [2:57:41<26:45, 48.65s/it] 88%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 231/263 [2:58:23<24:51, 46.60s/it] {'loss': '0.6353', 'grad_norm': '0.1761', 'learning_rate': '9.416e-06', 'ppl': '1.888', 'memory/max_active (GiB)': '34.63', 'memory/max_allocated (GiB)': '34.63', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '386.8', 'tokens/trainable': 3212523, 'tokens/total': 7006106, 'epoch': '0.8813'}
	88%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 231/263 [2:58:23<24:51, 46.60s/it] 88%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 232/263 [2:59:15<24:55, 48.25s/it] {'loss': '0.5426', 'grad_norm': '0.1728', 'learning_rate': '8.862e-06', 'ppl': '1.721', 'memory/max_active (GiB)': '38.02', 'memory/max_allocated (GiB)': '38.02', 'memory/device_reserved (GiB)': '48.33', 'tokens/train_per_sec_per_gpu': '275.8', 'tokens/trainable': 3226889, 'tokens/total': 7039494, 'epoch': '0.8851'}
	88%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 232/263 [2:59:15<24:55, 48.25s/it] 89%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 233/263 [2:59:59<23:35, 47.20s/it] {'loss': '0.6979', 'grad_norm': '0.2131', 'learning_rate': '8.325e-06', 'ppl': '2.01', 'memory/max_active (GiB)': '35.02', 'memory/max_allocated (GiB)': '35.02', 'memory/device_reserved (GiB)': '44.24', 'tokens/train_per_sec_per_gpu': '322.7', 'tokens/trainable': 3241328, 'tokens/total': 7070424, 'epoch': '0.8889'}
	89%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 233/263 [2:59:59<23:35, 47.20s/it] 89%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 234/263 [3:00:41<22:00, 45.55s/it] {'loss': '0.6724', 'grad_norm': '0.2417', 'learning_rate': '7.803e-06', 'ppl': '1.959', 'memory/max_active (GiB)': '28.09', 'memory/max_allocated (GiB)': '28.09', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 3251965, 'tokens/total': 7093680, 'epoch': '0.8927'}
	89%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 234/263 [3:00:41<22:00, 45.55s/it] 89%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 235/263 [3:01:27<21:14, 45.53s/it] {'loss': '0.6359', 'grad_norm': '0.1907', 'learning_rate': '7.298e-06', 'ppl': '1.889', 'memory/max_active (GiB)': '36.62', 'memory/max_allocated (GiB)': '36.62', 'memory/device_reserved (GiB)': '46.28', 'tokens/train_per_sec_per_gpu': '325.9', 'tokens/trainable': 3266781, 'tokens/total': 7123902, 'epoch': '0.8965'}
	89%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 235/263 [3:01:27<21:14, 45.53s/it] 90%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 236/263 [3:02:15<20:50, 46.32s/it] {'loss': '0.6866', 'grad_norm': '0.202', 'learning_rate': '6.809e-06', 'ppl': '1.987', 'memory/max_active (GiB)': '38.62', 'memory/max_allocated (GiB)': '38.62', 'memory/device_reserved (GiB)': '53.25', 'tokens/train_per_sec_per_gpu': '298.6', 'tokens/trainable': 3281164, 'tokens/total': 7156358, 'epoch': '0.9003'}
	90%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 236/263 [3:02:15<20:50, 46.32s/it] 90%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 237/263 [3:02:56<19:22, 44.71s/it] {'loss': '0.6841', 'grad_norm': '0.2104', 'learning_rate': '6.337e-06', 'ppl': '1.982', 'memory/max_active (GiB)': '36.98', 'memory/max_allocated (GiB)': '36.98', 'memory/device_reserved (GiB)': '46.64', 'tokens/train_per_sec_per_gpu': '374.6', 'tokens/trainable': 3296514, 'tokens/total': 7188274, 'epoch': '0.9041'}
	90%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 237/263 [3:02:56<19:22, 44.71s/it] 90%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 238/263 [3:03:46<19:19, 46.39s/it] {'loss': '0.6678', 'grad_norm': '0.1991', 'learning_rate': '5.881e-06', 'ppl': '1.95', 'memory/max_active (GiB)': '58.32', 'memory/max_allocated (GiB)': '58.32', 'memory/device_reserved (GiB)': '75.86', 'tokens/train_per_sec_per_gpu': '327.5', 'tokens/trainable': 3312986, 'tokens/total': 7224448, 'epoch': '0.908'}
	90%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 238/263 [3:03:46<19:19, 46.39s/it] 91%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 239/263 [3:04:33<18:35, 46.48s/it] {'loss': '0.6174', 'grad_norm': '0.1702', 'learning_rate': '5.441e-06', 'ppl': '1.854', 'memory/max_active (GiB)': '31.28', 'memory/max_allocated (GiB)': '31.28', 'memory/device_reserved (GiB)': '41.57', 'tokens/train_per_sec_per_gpu': '353.2', 'tokens/trainable': 3329480, 'tokens/total': 7253234, 'epoch': '0.9118'}
	91%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 239/263 [3:04:33<18:35, 46.48s/it] 91%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 240/263 [3:05:26<18:37, 48.60s/it] {'loss': '0.6532', 'grad_norm': '0.1679', 'learning_rate': '5.018e-06', 'ppl': '1.922', 'memory/max_active (GiB)': '34.53', 'memory/max_allocated (GiB)': '34.53', 'memory/device_reserved (GiB)': '43.5', 'tokens/train_per_sec_per_gpu': '319.1', 'tokens/trainable': 3346561, 'tokens/total': 7292512, 'epoch': '0.9156'}
	91%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 240/263 [3:05:26<18:37, 48.60s/it] 92%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 241/263 [3:06:09<17:10, 46.84s/it] {'loss': '0.6461', 'grad_norm': '0.2172', 'learning_rate': '4.612e-06', 'ppl': '1.908', 'memory/max_active (GiB)': '30.52', 'memory/max_allocated (GiB)': '30.52', 'memory/device_reserved (GiB)': '40.36', 'tokens/train_per_sec_per_gpu': '270.1', 'tokens/trainable': 3358110, 'tokens/total': 7319476, 'epoch': '0.9194'}
	92%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 241/263 [3:06:09<17:10, 46.84s/it] 92%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 242/263 [3:07:13<18:14, 52.12s/it] {'loss': '0.6017', 'grad_norm': '0.1813', 'learning_rate': '4.222e-06', 'ppl': '1.825', 'memory/max_active (GiB)': '46.95', 'memory/max_allocated (GiB)': '46.95', 'memory/device_reserved (GiB)': '66.02', 'tokens/train_per_sec_per_gpu': '347.9', 'tokens/trainable': 3380518, 'tokens/total': 7369626, 'epoch': '0.9232'}
	92%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 242/263 [3:07:13<18:14, 52.12s/it] 92%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 243/263 [3:08:11<17:54, 53.72s/it] {'loss': '0.5814', 'grad_norm': '0.1749', 'learning_rate': '3.85e-06', 'ppl': '1.789', 'memory/max_active (GiB)': '39.53', 'memory/max_allocated (GiB)': '39.53', 'memory/device_reserved (GiB)': '58.11', 'tokens/train_per_sec_per_gpu': '306.3', 'tokens/trainable': 3398118, 'tokens/total': 7406506, 'epoch': '0.927'}
	92%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 243/263 [3:08:11<17:54, 53.72s/it] 93%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 244/263 [3:08:58<16:25, 51.87s/it] {'loss': '0.632', 'grad_norm': '0.1955', 'learning_rate': '3.494e-06', 'ppl': '1.881', 'memory/max_active (GiB)': '49.89', 'memory/max_allocated (GiB)': '49.89', 'memory/device_reserved (GiB)': '70.63', 'tokens/train_per_sec_per_gpu': '358.4', 'tokens/trainable': 3415167, 'tokens/total': 7447132, 'epoch': '0.9309'}
	93%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 244/263 [3:08:58<16:25, 51.87s/it] 93%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 245/263 [3:09:42<14:46, 49.25s/it] {'loss': '0.683', 'grad_norm': '0.2159', 'learning_rate': '3.155e-06', 'ppl': '1.98', 'memory/max_active (GiB)': '31.1', 'memory/max_allocated (GiB)': '31.1', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 3428390, 'tokens/total': 7476600, 'epoch': '0.9347'}
	93%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 245/263 [3:09:42<14:46, 49.25s/it] 94%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 246/263 [3:11:01<16:32, 58.35s/it] {'loss': '0.6627', 'grad_norm': '0.1617', 'learning_rate': '2.833e-06', 'ppl': '1.94', 'memory/max_active (GiB)': '57.27', 'memory/max_allocated (GiB)': '57.27', 'memory/device_reserved (GiB)': '74.5', 'tokens/train_per_sec_per_gpu': '342.7', 'tokens/trainable': 3455668, 'tokens/total': 7533052, 'epoch': '0.9385'}
	94%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 246/263 [3:11:01<16:32, 58.35s/it] 94%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 247/263 [3:11:31<13:18, 49.90s/it] {'loss': '0.6176', 'grad_norm': '0.1991', 'learning_rate': '2.528e-06', 'ppl': '1.855', 'memory/max_active (GiB)': '27.2', 'memory/max_allocated (GiB)': '27.2', 'memory/device_reserved (GiB)': '35.37', 'tokens/train_per_sec_per_gpu': '385.7', 'tokens/trainable': 3467302, 'tokens/total': 7557486, 'epoch': '0.9423'}
	94%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 247/263 [3:11:31<13:18, 49.90s/it] 94%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 248/263 [3:12:15<11:59, 47.94s/it] {'loss': '0.6939', 'grad_norm': '0.2216', 'learning_rate': '2.241e-06', 'ppl': '2.002', 'memory/max_active (GiB)': '44.94', 'memory/max_allocated (GiB)': '44.94', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '307.3', 'tokens/trainable': 3480626, 'tokens/total': 7588958, 'epoch': '0.9461'}
	94%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 248/263 [3:12:15<11:59, 47.94s/it] 95%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 249/263 [3:13:08<11:32, 49.44s/it] {'loss': '0.6791', 'grad_norm': '0.2064', 'learning_rate': '1.97e-06', 'ppl': '1.972', 'memory/max_active (GiB)': '39.77', 'memory/max_allocated (GiB)': '39.77', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '323.8', 'tokens/trainable': 3497767, 'tokens/total': 7623894, 'epoch': '0.9499'}
	95%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 249/263 [3:13:08<11:32, 49.44s/it] 95%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 250/263 [3:14:15<11:50, 54.69s/it] {'loss': '0.6591', 'grad_norm': '0.2037', 'learning_rate': '1.717e-06', 'ppl': '1.933', 'memory/max_active (GiB)': '55.75', 'memory/max_allocated (GiB)': '55.75', 'memory/device_reserved (GiB)': '72.33', 'tokens/train_per_sec_per_gpu': '270.7', 'tokens/trainable': 3515893, 'tokens/total': 7665338, 'epoch': '0.9537'}
	95%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 250/263 [3:14:15<11:50, 54.69s/it] 95%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 251/263 [3:14:50<09:48, 49.05s/it] {'loss': '0.6294', 'grad_norm': '0.2187', 'learning_rate': '1.481e-06', 'ppl': '1.876', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '32.43', 'tokens/train_per_sec_per_gpu': '270.5', 'tokens/trainable': 3525599, 'tokens/total': 7687250, 'epoch': '0.9576'}
	95%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 251/263 [3:14:50<09:48, 49.05s/it] 96%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 252/263 [3:15:41<09:04, 49.48s/it] {'loss': '0.545', 'grad_norm': '0.1694', 'learning_rate': '1.262e-06', 'ppl': '1.725', 'memory/max_active (GiB)': '48.1', 'memory/max_allocated (GiB)': '48.1', 'memory/device_reserved (GiB)': '67.93', 'tokens/train_per_sec_per_gpu': '384.7', 'tokens/trainable': 3545027, 'tokens/total': 7722636, 'epoch': '0.9614'}
	96%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 252/263 [3:15:41<09:04, 49.48s/it] 96%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 253/263 [3:16:51<09:16, 55.62s/it] {'loss': '0.5997', 'grad_norm': '0.1826', 'learning_rate': '1.061e-06', 'ppl': '1.822', 'memory/max_active (GiB)': '50.46', 'memory/max_allocated (GiB)': '50.46', 'memory/device_reserved (GiB)': '71.27', 'tokens/train_per_sec_per_gpu': '251.7', 'tokens/trainable': 3562629, 'tokens/total': 7766154, 'epoch': '0.9652'}
	96%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 253/263 [3:16:51<09:16, 55.62s/it] 97%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 254/263 [3:17:55<08:42, 58.06s/it] {'loss': '0.6639', 'grad_norm': '0.187', 'learning_rate': '8.773e-07', 'ppl': '1.942', 'memory/max_active (GiB)': '51.43', 'memory/max_allocated (GiB)': '51.43', 'memory/device_reserved (GiB)': '72.87', 'tokens/train_per_sec_per_gpu': '309.8', 'tokens/trainable': 3582377, 'tokens/total': 7807646, 'epoch': '0.969'}
	97%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ \| 254/263 [3:17:55<08:42, 58.06s/it] 97%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 255/263 [3:18:36<07:04, 53.09s/it] {'loss': '0.7493', 'grad_norm': '0.2242', 'learning_rate': '7.108e-07', 'ppl': '2.116', 'memory/max_active (GiB)': '34.97', 'memory/max_allocated (GiB)': '34.97', 'memory/device_reserved (GiB)': '44.01', 'tokens/train_per_sec_per_gpu': '305.2', 'tokens/trainable': 3595039, 'tokens/total': 7833474, 'epoch': '0.9728'}
	97%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 255/263 [3:18:36<07:04, 53.09s/it] 97%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 256/263 [3:19:23<05:58, 51.28s/it] {'loss': '0.6764', 'grad_norm': '0.2359', 'learning_rate': '5.618e-07', 'ppl': '1.967', 'memory/max_active (GiB)': '21.36', 'memory/max_allocated (GiB)': '21.36', 'memory/device_reserved (GiB)': '26.67', 'tokens/train_per_sec_per_gpu': '177.1', 'tokens/trainable': 3603378, 'tokens/total': 7853228, 'epoch': '0.9766'}
	97%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ \| 256/263 [3:19:23<05:58, 51.28s/it] 98%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 257/263 [3:20:16<05:10, 51.73s/it] {'loss': '0.5852', 'grad_norm': '0.1929', 'learning_rate': '4.302e-07', 'ppl': '1.795', 'memory/max_active (GiB)': '25.94', 'memory/max_allocated (GiB)': '25.94', 'memory/device_reserved (GiB)': '33.87', 'tokens/train_per_sec_per_gpu': '263.3', 'tokens/trainable': 3617272, 'tokens/total': 7882296, 'epoch': '0.9804'}
	98%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ \| 257/263 [3:20:16<05:10, 51.73s/it] 98%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 258/263 [3:21:05<04:14, 50.88s/it] {'loss': '0.709', 'grad_norm': '0.2294', 'learning_rate': '3.161e-07', 'ppl': '2.032', 'memory/max_active (GiB)': '46.35', 'memory/max_allocated (GiB)': '46.35', 'memory/device_reserved (GiB)': '65.2', 'tokens/train_per_sec_per_gpu': '308.2', 'tokens/trainable': 3632340, 'tokens/total': 7915626, 'epoch': '0.9843'}
	98%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ \| 258/263 [3:21:05<04:14, 50.88s/it] 98%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 259/263 [3:21:34<02:57, 44.37s/it] {'loss': '0.5761', 'grad_norm': '0.2068', 'learning_rate': '2.196e-07', 'ppl': '1.779', 'memory/max_active (GiB)': '24.15', 'memory/max_allocated (GiB)': '24.15', 'memory/device_reserved (GiB)': '39.5', 'tokens/train_per_sec_per_gpu': '365.9', 'tokens/trainable': 3643011, 'tokens/total': 7937480, 'epoch': '0.9881'}
	98%\|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ \| 259/263 [3:21:34<02:57, 44.37s/it] 99%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 260/263 [3:22:14<02:09, 43.19s/it] {'loss': '0.654', 'grad_norm': '0.1978', 'learning_rate': '1.405e-07', 'ppl': '1.923', 'memory/max_active (GiB)': '29.8', 'memory/max_allocated (GiB)': '29.8', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '340.3', 'tokens/trainable': 3656769, 'tokens/total': 7966668, 'epoch': '0.9919'}
	99%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ \| 260/263 [3:22:14<02:09, 43.19s/it] 99%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 261/263 [3:23:10<01:34, 47.04s/it] {'loss': '0.6804', 'grad_norm': '0.2136', 'learning_rate': '7.906e-08', 'ppl': '1.975', 'memory/max_active (GiB)': '32.49', 'memory/max_allocated (GiB)': '32.49', 'memory/device_reserved (GiB)': '41.61', 'tokens/train_per_sec_per_gpu': '282.8', 'tokens/trainable': 3672618, 'tokens/total': 8001206, 'epoch': '0.9957'}
	99%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ \| 261/263 [3:23:10<01:34, 47.04s/it] 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍\| 262/263 [3:24:03<00:48, 48.74s/it] {'loss': '0.5862', 'grad_norm': '0.2206', 'learning_rate': '3.514e-08', 'ppl': '1.797', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '40.98', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3683361, 'tokens/total': 8024868, 'epoch': '0.9995'}
	100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍\| 262/263 [3:24:03<00:48, 48.74s/it] 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 263/263 [3:24:08<00:00, 35.60s/it] {'loss': '0.6559', 'grad_norm': '0.6503', 'learning_rate': '8.786e-09', 'ppl': '1.927', 'memory/max_active (GiB)': '17.39', 'memory/max_allocated (GiB)': '17.39', 'memory/device_reserved (GiB)': '32.52', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3684370, 'tokens/total': 8026968, 'epoch': '1'}
	100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 263/263 [3:24:08<00:00, 35.60s/it][2026-06-14 17:37:06,485] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B/checkpoint-263
	{'train_runtime': '1.225e+04', 'train_samples_per_second': '0.343', 'train_steps_per_second': '0.021', 'train_loss': '0.731', 'memory/max_active (GiB)': '11.01', 'memory/max_allocated (GiB)': '11.01', 'memory/device_reserved (GiB)': '20.05', 'epoch': '1'}
	100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 263/263 [3:24:11<00:00, 35.60s/it] 100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 263/263 [3:24:11<00:00, 46.58s/it]
	[2026-06-14 17:37:12,240] [INFO] [axolotl.train.save_trained_model:267] [PID:3393] Training completed! Saving trained model to ./outputs/Jacob-2-E4B.
	[2026-06-14 17:37:12,857] [INFO] [axolotl.train.save_trained_model:388] [PID:3393] Model successfully saved to ./outputs/Jacob-2-E4B
	[2026-06-14 17:37:13,576] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B
	Processing Files (0 / 0) : \| \| 0.00B / 0.00B
	New Data Upload : \| \| 0.00B / 0.00B [A

	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB [A[A


	...adapter_model.safetensors: 69%\|█████████████████████████████████████████████████████████████████████████████████████▏ \| 95.9MB / 140MB [A[A[A



	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB [A[A[A[A

	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB [A[A


	...adapter_model.safetensors: 69%\|█████████████████████████████████████████████████████████████████████████████████████▏ \| 95.9MB / 140MB [A[A[A



	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB [A[A[A[A Processing Files (2 / 3) : 75%\|████████████████████████████████████████████████████████████████████████████████████████████▍ \| 128MB / 172MB, ???B/s

	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB [A[A


	...adapter_model.safetensors: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 140MB / 140MB [A[A[A



	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB [A[A[A[A Processing Files (3 / 3) : 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 172MB / 172MB, 219MB/s

	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB [A[A


	...adapter_model.safetensors: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 140MB / 140MB [A[A[A



	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB [A[A[A[A

	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB [A[A


	...adapter_model.safetensors: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 140MB / 140MB [A[A[A



	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB [A[A[A[A Processing Files (3 / 3) : 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 172MB / 172MB, 109MB/s
	New Data Upload : \| \| 0.00B / 0.00B, 0.00B/s
	...b-2-E4B/training_args.bin: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 24.2kB / 24.2kB
	...adapter_model.safetensors: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 140MB / 140MB
	...acob-2-E4B/tokenizer.json: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 32.2MB / 32.2MB