Text Generation
PEFT
Safetensors
GGUF
Transformers
German
gemma4
image-text-to-text
axolotl
lora
conversational
8-bit precision
bitsandbytes
Instructions to use jacob-ml/Jacob-2-E4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jacob-ml/Jacob-2-E4B with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it") model = PeftModel.from_pretrained(base_model, "jacob-ml/Jacob-2-E4B") - Transformers
How to use jacob-ml/Jacob-2-E4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jacob-ml/Jacob-2-E4B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("jacob-ml/Jacob-2-E4B") model = AutoModelForMultimodalLM.from_pretrained("jacob-ml/Jacob-2-E4B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jacob-ml/Jacob-2-E4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jacob-ml/Jacob-2-E4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jacob-ml/Jacob-2-E4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jacob-ml/Jacob-2-E4B
- SGLang
How to use jacob-ml/Jacob-2-E4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jacob-ml/Jacob-2-E4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jacob-ml/Jacob-2-E4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jacob-ml/Jacob-2-E4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jacob-ml/Jacob-2-E4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use jacob-ml/Jacob-2-E4B with Docker Model Runner:
docker model run hf.co/jacob-ml/Jacob-2-E4B
| [2026-06-14 14:08:51,250] [DEBUG] [axolotl.utils.config.resolve_dtype:74] [PID:3393] bf16 support detected, enabling for this configuration. | |
| [2026-06-14 14:08:51,433] [WARNING] [axolotl.utils.config.normalize_config:281] [PID:3393] Gemma4 requires use_reentrant=False for gradient checkpointing in distributed training. Setting use_reentrant=False. | |
| [2026-06-14 14:08:51,433] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:3393] baseline 0.000GB () | |
| [2026-06-14 14:08:51,434] [INFO] [axolotl.cli.config.load_cfg:333] [PID:3393] config: | |
| { | |
| "activation_offloading": true, | |
| "adapter": "lora", | |
| "attn_implementation": "sdpa", | |
| "attn_needs_dtype_cast": false, | |
| "attn_supports_packing": false, | |
| "attn_uses_flash_lib": false, | |
| "axolotl_config_path": "./config.yaml", | |
| "base_model": "google/gemma-4-E4B-it", | |
| "base_model_config": "google/gemma-4-E4B-it", | |
| "batch_size": 16, | |
| "bf16": true, | |
| "capabilities": { | |
| "bf16": true, | |
| "compute_capability": "sm_80", | |
| "fp8": false, | |
| "n_gpu": 1, | |
| "n_node": 1, | |
| "tf32": true | |
| }, | |
| "chat_template": "jinja", | |
| "chat_template_jinja": "./jinja", | |
| "context_parallel_size": 1, | |
| "cut_cross_entropy": true, | |
| "dataloader_num_workers": 1, | |
| "dataloader_pin_memory": true, | |
| "dataloader_prefetch_factor": 256, | |
| "dataset_num_proc": 31, | |
| "dataset_prepared_path": "./dataset-e4b", | |
| "datasets": [ | |
| { | |
| "chat_template": "tokenizer_default", | |
| "field_messages": "messages", | |
| "field_tools": "tools", | |
| "message_property_mappings": { | |
| "content": "content", | |
| "role": "role" | |
| }, | |
| "path": "jacob-ml/Jacob-2-SSFT-filtered", | |
| "split": "train", | |
| "trust_remote_code": false, | |
| "type": "chat_template" | |
| } | |
| ], | |
| "ddp": false, | |
| "device": "cuda:0", | |
| "dion_rank_fraction": 1.0, | |
| "dion_rank_multiple_of": 1, | |
| "eaft_alpha": 1.0, | |
| "eaft_k": 20, | |
| "env_capabilities": { | |
| "torch_version": "2.10.0" | |
| }, | |
| "eval_batch_size": 2, | |
| "eval_causal_lm_metrics": [ | |
| "sacrebleu", | |
| "comet", | |
| "ter", | |
| "chrf" | |
| ], | |
| "eval_max_new_tokens": 128, | |
| "eval_table_size": 0, | |
| "experimental_skip_move_to_device": true, | |
| "fp16": false, | |
| "freeze_mm_modules": true, | |
| "generate_samples": false, | |
| "generation_do_sample": true, | |
| "generation_max_new_tokens": 50, | |
| "generation_prompt_ratio": 0.5, | |
| "generation_temperature": 0.7, | |
| "gradient_accumulation_steps": 8, | |
| "gradient_checkpointing": true, | |
| "gradient_checkpointing_kwargs": { | |
| "use_reentrant": false | |
| }, | |
| "hub_model_id": "jacob-ml/Jacob-2-E4B", | |
| "include_tkps": true, | |
| "is_multimodal": true, | |
| "layer_offloading": true, | |
| "learning_rate": 0.0002, | |
| "lisa_layers_attribute": "model.layers", | |
| "load_best_model_at_end": false, | |
| "load_in_4bit": false, | |
| "load_in_8bit": true, | |
| "local_rank": 0, | |
| "logging_steps": 1, | |
| "lora_alpha": 16, | |
| "lora_dropout": 0.0, | |
| "lora_r": 16, | |
| "lora_target_modules": "model.language_model.layers.[\\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj", | |
| "loraplus_lr_embedding": 1e-06, | |
| "lr_scheduler": "cosine", | |
| "mean_resizing_embeddings": false, | |
| "merge_method": "memory_efficient", | |
| "micro_batch_size": 2, | |
| "model_config_type": "gemma4", | |
| "model_config_type_text": "gemma4_text", | |
| "num_epochs": 1.0, | |
| "num_generation_samples": 3, | |
| "optimizer": "adamw_torch_8bit", | |
| "otel_metrics_host": "localhost", | |
| "otel_metrics_port": 8000, | |
| "output_dir": "./outputs/Jacob-2-E4B", | |
| "pad_to_sequence_len": false, | |
| "plugins": [ | |
| "axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin" | |
| ], | |
| "pretrain_multipack_attn": true, | |
| "processor_config": "google/gemma-4-E4B-it", | |
| "profiler_steps_start": 0, | |
| "qgalore_cos_threshold": 0.4, | |
| "qgalore_gamma_proj": 2, | |
| "qgalore_proj_bits": 4, | |
| "qgalore_proj_group_size": 256, | |
| "qgalore_proj_quant": true, | |
| "qgalore_proj_type": "std", | |
| "qgalore_queue_size": 5, | |
| "qgalore_rank": 256, | |
| "qgalore_scale": 0.25, | |
| "qgalore_update_proj_gap": 200, | |
| "qlora_sharded_model_loading": false, | |
| "quantize_moe_experts": false, | |
| "ray_num_workers": 1, | |
| "relora_prune_method": "magnitude", | |
| "resources_per_worker": { | |
| "GPU": 1 | |
| }, | |
| "sample_packing": false, | |
| "sample_packing_bin_size": 200, | |
| "sample_packing_group_size": 100000, | |
| "save_only_model": false, | |
| "save_safetensors": true, | |
| "sequence_len": 8192, | |
| "shuffle_before_merging_datasets": false, | |
| "shuffle_merged_datasets": true, | |
| "skip_prepare_dataset": false, | |
| "streaming_multipack_buffer_size": 10000, | |
| "strict": false, | |
| "tensor_parallel_size": 1, | |
| "tf32": false, | |
| "tiled_mlp_use_original_mlp": true, | |
| "tokenizer_config": "google/gemma-4-E4B-it", | |
| "tokenizer_save_jinja_files": true, | |
| "torch_dtype": "torch.bfloat16", | |
| "train_on_inputs": false, | |
| "trl": { | |
| "async_prefetch": false, | |
| "log_completions": false, | |
| "mask_truncated_completions": false, | |
| "ref_model_mixup_alpha": 0.9, | |
| "ref_model_sync_steps": 64, | |
| "replay_buffer_size": 0, | |
| "replay_recompute_logps": true, | |
| "reroll_max_groups": 1, | |
| "reroll_start_fraction": 1.0, | |
| "reward_num_workers": 1, | |
| "scale_rewards": true, | |
| "skip_zero_advantage_batches": true, | |
| "sync_ref_model": false, | |
| "use_data_producer": false, | |
| "use_vllm": false, | |
| "vllm_lora_sync": false, | |
| "vllm_server_host": "0.0.0.0", | |
| "vllm_server_port": 8000 | |
| }, | |
| "use_otel_metrics": false, | |
| "use_ray": false, | |
| "val_set_size": 0.0, | |
| "vllm": { | |
| "device": "auto", | |
| "dtype": "auto", | |
| "gpu_memory_utilization": 0.9, | |
| "host": "0.0.0.0", | |
| "port": 8000 | |
| }, | |
| "warmup_ratio": 0.1, | |
| "weight_decay": 0.0, | |
| "world_size": 1 | |
| } | |
| [2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos> | |
| [2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos> | |
| [2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad> | |
| [2026-06-14 14:08:55,226] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk> | |
| [2026-06-14 14:08:55,227] [INFO] [axolotl.utils.data.shared.load_preprocessed_dataset:482] [PID:3393] Unable to find prepared dataset in dataset-e4b/226f5539ba5a2355ba6a34bd68b2a326 | |
| [2026-06-14 14:08:55,228] [INFO] [axolotl.utils.data.sft._load_raw_datasets:320] [PID:3393] Loading raw datasets... | |
| [2026-06-14 14:08:55,228] [WARNING] [axolotl.utils.data.sft._load_raw_datasets:322] [PID:3393] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset using `axolotl preprocess path/to/config.yml`. | |
| Downloading (incomplete total...): 0.00B [00:00, ?B/s] | |
| Fetching 0 files: 0it [00:00, ?it/s][A Fetching 0 files: 0it [00:00, ?it/s] | |
| Download complete: : 0.00B [00:00, ?B/s] Download complete: : 0.00B [00:00, ?B/s] | |
| [2026-06-14 14:08:56,312] [INFO] [axolotl.utils.data.wrappers.get_dataset_wrapper:87] [PID:3393] Loading dataset: jacob-ml/Jacob-2-SSFT-filtered with base_type: chat_template and prompt_style: None | |
| [2026-06-14 14:08:56,315] [INFO] [axolotl.prompt_strategies.chat_template.__call__:1209] [PID:3393] Using chat template: | |
| --- | |
| {%- macro format_parameters(properties, required, filter_keys=false) -%} | |
| {%- set standard_keys = ['description', 'type', 'properties', 'required', 'nullable'] -%} | |
| {%- set ns = namespace(found_first=false) -%} | |
| {%- for key, value in properties | dictsort -%} | |
| {%- set add_comma = false -%} | |
| {%- if not filter_keys or key not in standard_keys -%} | |
| {%- if ns.found_first %},{% endif -%} | |
| {%- set ns.found_first = true -%} | |
| {{ key }}:{ | |
| {%- if value['description'] -%} | |
| description:<|"|>{{ value['description'] }}<|"|> | |
| {%- set add_comma = true -%} | |
| {%- endif -%} | |
| {%- if value['type'] | upper == 'STRING' -%} | |
| {%- if value['enum'] -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| enum:{{ format_argument(value['enum']) }} | |
| {%- endif -%} | |
| {%- elif value['type'] | upper == 'ARRAY' -%} | |
| {%- if value['items'] is mapping and value['items'] -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| items:{ | |
| {%- set ns_items = namespace(found_first=false) -%} | |
| {%- for item_key, item_value in value['items'] | dictsort -%} | |
| {%- if item_value is not none -%} | |
| {%- if ns_items.found_first %},{% endif -%} | |
| {%- set ns_items.found_first = true -%} | |
| {%- if item_key == 'properties' -%} | |
| properties:{ | |
| {%- if item_value is mapping -%} | |
| {{- format_parameters(item_value, value['items']['required'] | default([])) -}} | |
| {%- endif -%} | |
| } | |
| {%- elif item_key == 'required' -%} | |
| required:[ | |
| {%- for req_item in item_value -%} | |
| <|"|>{{- req_item -}}<|"|> | |
| {%- if not loop.last %},{% endif -%} | |
| {%- endfor -%} | |
| ] | |
| {%- elif item_key == 'type' -%} | |
| {%- if item_value is string -%} | |
| type:{{ format_argument(item_value | upper) }} | |
| {%- else -%} | |
| type:{{ format_argument(item_value | map('upper') | list) }} | |
| {%- endif -%} | |
| {%- else -%} | |
| {{ item_key }}:{{ format_argument(item_value) }} | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| } | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- if value['nullable'] %} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| nullable:true | |
| {%- endif -%} | |
| {%- if value['type'] | upper == 'OBJECT' -%} | |
| {%- if value['properties'] is defined and value['properties'] is mapping -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| properties:{ | |
| {{- format_parameters(value['properties'], value['required'] | default([])) -}} | |
| } | |
| {%- elif value is mapping -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| properties:{ | |
| {{- format_parameters(value, value['required'] | default([]), filter_keys=true) -}} | |
| } | |
| {%- endif -%} | |
| {%- if value['required'] -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| required:[ | |
| {%- for item in value['required'] | default([]) -%} | |
| <|"|>{{- item -}}<|"|> | |
| {%- if not loop.last %},{% endif -%} | |
| {%- endfor -%} | |
| ] | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- if add_comma %},{%- else -%} {%- set add_comma = true -%} {% endif -%} | |
| type:<|"|>{{ value['type'] | upper }}<|"|>} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- endmacro -%} | |
| {%- macro format_function_declaration(tool_data) -%} | |
| declaration:{{- tool_data['function']['name'] -}}{description:<|"|>{{- tool_data['function']['description'] -}}<|"|> | |
| {%- set params = tool_data['function']['parameters'] -%} | |
| {%- if params -%} | |
| ,parameters:{ | |
| {%- if params['properties'] -%} | |
| properties:{ {{- format_parameters(params['properties'], params['required']) -}} }, | |
| {%- endif -%} | |
| {%- if params['required'] -%} | |
| required:[ | |
| {%- for item in params['required'] -%} | |
| <|"|>{{- item -}}<|"|> | |
| {{- ',' if not loop.last -}} | |
| {%- endfor -%} | |
| ], | |
| {%- endif -%} | |
| {%- if params['type'] -%} | |
| type:<|"|>{{- params['type'] | upper -}}<|"|>} | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- if 'response' in tool_data['function'] -%} | |
| {%- set response_declaration = tool_data['function']['response'] -%} | |
| ,response:{ | |
| {%- if response_declaration['description'] -%} | |
| description:<|"|>{{- response_declaration['description'] -}}<|"|>, | |
| {%- endif -%} | |
| {%- if response_declaration['type'] | upper == 'OBJECT' -%} | |
| type:<|"|>{{- response_declaration['type'] | upper -}}<|"|>} | |
| {%- endif -%} | |
| {%- endif -%} | |
| } | |
| {%- endmacro -%} | |
| {%- macro format_argument(argument, escape_keys=True) -%} | |
| {%- if argument is string -%} | |
| {{- '<|"|>' + argument + '<|"|>' -}} | |
| {%- elif argument is boolean -%} | |
| {{- 'true' if argument else 'false' -}} | |
| {%- elif argument is mapping -%} | |
| {{- '{' -}} | |
| {%- set ns = namespace(found_first=false) -%} | |
| {%- for key, value in argument | dictsort -%} | |
| {%- if ns.found_first %},{% endif -%} | |
| {%- set ns.found_first = true -%} | |
| {%- if escape_keys -%} | |
| {{- '<|"|>' + key + '<|"|>' -}} | |
| {%- else -%} | |
| {{- key -}} | |
| {%- endif -%} | |
| :{{- format_argument(value, escape_keys=escape_keys) -}} | |
| {%- endfor -%} | |
| {{- '}' -}} | |
| {%- elif argument is sequence -%} | |
| {{- '[' -}} | |
| {%- for item in argument -%} | |
| {{- format_argument(item, escape_keys=escape_keys) -}} | |
| {%- if not loop.last %},{% endif -%} | |
| {%- endfor -%} | |
| {{- ']' -}} | |
| {%- else -%} | |
| {{- argument -}} | |
| {%- endif -%} | |
| {%- endmacro -%} | |
| {%- macro strip_thinking(text) -%} | |
| {%- set ns = namespace(result='') -%} | |
| {%- for part in text.split('<channel|>') -%} | |
| {%- if '<|channel>' in part -%} | |
| {%- set ns.result = ns.result + part.split('<|channel>')[0] -%} | |
| {%- else -%} | |
| {%- set ns.result = ns.result + part -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {{- ns.result | trim -}} | |
| {%- endmacro -%} | |
| {%- macro format_tool_response_block(tool_name, response) -%} | |
| {{- '<|tool_response>' -}} | |
| {%- if response is mapping -%} | |
| {{- 'response:' + tool_name + '{' -}} | |
| {%- for key, value in response | dictsort -%} | |
| {{- key -}}:{{- format_argument(value, escape_keys=False) -}} | |
| {%- if not loop.last %},{% endif -%} | |
| {%- endfor -%} | |
| {{- '}' -}} | |
| {%- else -%} | |
| {{- 'response:' + tool_name + '{value:' + format_argument(response, escape_keys=False) + '}' -}} | |
| {%- endif -%} | |
| {{- '<tool_response|>' -}} | |
| {%- endmacro -%} | |
| {%- set ns = namespace(prev_message_type=None) -%} | |
| {%- set loop_messages = messages -%} | |
| {{- bos_token -}} | |
| {#- Handle System/Tool Definitions Block -#} | |
| {%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%} | |
| {{- '<|turn>system\n' -}} | |
| {#- Inject Thinking token at the very top of the FIRST system turn -#} | |
| {%- if enable_thinking is defined and enable_thinking -%} | |
| {{- '<|think|>\n' -}} | |
| {%- set ns.prev_message_type = 'think' -%} | |
| {%- endif -%} | |
| {%- if messages[0]['role'] in ['system', 'developer'] -%} | |
| {%- if messages[0]['content'] is string -%} | |
| {{- messages[0]['content'] | trim -}} | |
| {%- elif messages[0]['content'] is sequence -%} | |
| {%- for item in messages[0]['content'] -%} | |
| {{- item['text'] | trim + ' '-}} | |
| {%- endfor -%} | |
| {%- endif -%} | |
| {%- set loop_messages = messages[1:] -%} | |
| {%- endif -%} | |
| {%- if tools -%} | |
| {%- for tool in tools %} | |
| {{- '<|tool>' -}} | |
| {{- format_function_declaration(tool) | trim -}} | |
| {{- '<tool|>' -}} | |
| {%- endfor %} | |
| {%- set ns.prev_message_type = 'tool' -%} | |
| {%- endif -%} | |
| {{- '<turn|>\n' -}} | |
| {%- endif %} | |
| {#- Pre-scan: find last user message index for reasoning guard -#} | |
| {%- set ns_turn = namespace(last_user_idx=-1) -%} | |
| {%- for i in range(loop_messages | length) -%} | |
| {%- if loop_messages[i]['role'] == 'user' -%} | |
| {%- set ns_turn.last_user_idx = i -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {#- Loop through messages -#} | |
| {%- for message in loop_messages -%} | |
| {%- if message['role'] != 'tool' -%} | |
| {%- set ns.prev_message_type = None -%} | |
| {%- set role = 'model' if message['role'] == 'assistant' else message['role'] -%} | |
| {%- if message['role'] == 'tool' and 'name' in message -%} | |
| {%- set _tool_name = message['name'] -%} | |
| {%- endif -%} | |
| {#- Detect continuation: suppress duplicate <|turn>model when previous non-tool message was also assistant -#} | |
| {%- set prev_nt = namespace(role=None, found=false) -%} | |
| {%- if loop.index0 > 0 -%} | |
| {%- for j in range(loop.index0 - 1, -1, -1) -%} | |
| {%- if not prev_nt.found -%} | |
| {%- if loop_messages[j]['role'] != 'tool' -%} | |
| {%- set prev_nt.role = loop_messages[j]['role'] -%} | |
| {%- set prev_nt.found = true -%} | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- endif -%} | |
| {%- set continue_same_model_turn = (role == 'model' and prev_nt.role == 'assistant') -%} | |
| {%- if not continue_same_model_turn -%} | |
| {{- '<|turn>' + role + '\n' }} | |
| {%- endif -%} | |
| {#- Render reasoning/reasoning_content as thinking channel -#} | |
| {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%} | |
| {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%} | |
| {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}} | |
| {%- endif -%} | |
| {%- if message['tool_calls'] -%} | |
| {%- for tool_call in message['tool_calls'] -%} | |
| {%- set function = tool_call['function'] -%} | |
| {{- '<|tool_call>call:' + function['name'] + '{' -}} | |
| {%- if function['arguments'] is mapping -%} | |
| {%- set ns_args = namespace(found_first=false) -%} | |
| {%- for key, value in function['arguments'] | dictsort -%} | |
| {%- if ns_args.found_first %},{% endif -%} | |
| {%- set ns_args.found_first = true -%} | |
| {{- key -}}:{{- format_argument(value, escape_keys=False) -}} | |
| {%- endfor -%} | |
| {%- elif function['arguments'] is string -%} | |
| {{- function['arguments'] -}} | |
| {%- endif -%} | |
| {{- '}<tool_call|>' -}} | |
| {%- endfor -%} | |
| {%- set ns.prev_message_type = 'tool_call' -%} | |
| {%- endif -%} | |
| {%- set ns_tr_out = namespace(flag=false) -%} | |
| {%- if message.get('tool_responses') -%} | |
| {#- Legacy: tool_responses embedded on the assistant message (Google/Gemma native) -#} | |
| {%- for tool_response in message['tool_responses'] -%} | |
| {{- format_tool_response_block(tool_response['name'] | default('unknown', true), tool_response['response']) -}} | |
| {%- set ns_tr_out.flag = true -%} | |
| {%- set ns.prev_message_type = 'tool_response' -%} | |
| {%- endfor -%} | |
| {%- elif message.get('tool_calls') -%} | |
| {#- OpenAI Chat Completions: forward-scan consecutive role:tool messages -#} | |
| {%- set ns_tool_scan = namespace(stopped=false) -%} | |
| {%- for k in range(loop.index0 + 1, loop_messages | length) -%} | |
| {%- if ns_tool_scan.stopped -%} | |
| {%- elif loop_messages[k]['role'] != 'tool' -%} | |
| {%- set ns_tool_scan.stopped = true -%} | |
| {%- else -%} | |
| {%- set follow = loop_messages[k] -%} | |
| {#- Resolve tool_call_id to function name -#} | |
| {%- set ns_tname = namespace(name=follow['name'] | default('unknown', true)) -%} | |
| {%- for tc in message['tool_calls'] -%} | |
| {%- if tc.get('id') == follow.get('tool_call_id') -%} | |
| {%- set ns_tname.name = tc['function']['name'] -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {#- Handle content as string or content-parts array -#} | |
| {%- set tool_body = follow.get('content') -%} | |
| {%- if tool_body is string -%} | |
| {{- format_tool_response_block(ns_tname.name, tool_body) -}} | |
| {%- elif tool_body is sequence and tool_body is not string -%} | |
| {%- set ns_txt = namespace(s='') -%} | |
| {%- for part in tool_body -%} | |
| {%- if part.get('type') == 'text' -%} | |
| {%- set ns_txt.s = ns_txt.s + (part.get('text') | default('')) -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}} | |
| {%- for part in tool_body -%} | |
| {%- if part.get('type') == 'image' -%} | |
| {{- '<|image|>' -}} | |
| {%- elif part.get('type') == 'audio' -%} | |
| {{- '<|audio|>' -}} | |
| {%- elif part.get('type') == 'video' -%} | |
| {{- '<|video|>' -}} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- else -%} | |
| {{- format_tool_response_block(ns_tname.name, tool_body) -}} | |
| {%- endif -%} | |
| {%- set ns_tr_out.flag = true -%} | |
| {%- set ns.prev_message_type = 'tool_response' -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- endif -%} | |
| {%- set captured_content -%} | |
| {%- if message['content'] is string -%} | |
| {%- if role == 'model' -%} | |
| {{- strip_thinking(message['content']) -}} | |
| {%- else -%} | |
| {{- message['content'] | trim -}} | |
| {%- endif -%} | |
| {%- elif message['content'] is sequence -%} | |
| {%- for item in message['content'] -%} | |
| {%- if item['type'] == 'text' -%} | |
| {%- if role == 'model' -%} | |
| {{- strip_thinking(item['text']) -}} | |
| {%- else -%} | |
| {{- item['text'] | trim -}} | |
| {%- endif -%} | |
| {%- elif item['type'] == 'image' -%} | |
| {{- '<|image|>' -}} | |
| {%- set ns.prev_message_type = 'image' -%} | |
| {%- elif item['type'] == 'audio' -%} | |
| {{- '<|audio|>' -}} | |
| {%- set ns.prev_message_type = 'audio' -%} | |
| {%- elif item['type'] == 'video' -%} | |
| {{- '<|video|>' -}} | |
| {%- set ns.prev_message_type = 'video' -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- endif -%} | |
| {%- endset -%} | |
| {{- captured_content -}} | |
| {%- set has_content = captured_content | trim | length > 0 -%} | |
| {%- if ns.prev_message_type == 'tool_call' and not ns_tr_out.flag -%} | |
| {{- '<|tool_response>' -}} | |
| {%- elif not (ns_tr_out.flag and not has_content) -%} | |
| {{- '<turn|>\n' -}} | |
| {%- endif -%} | |
| {%- endif -%} | |
| {%- endfor -%} | |
| {%- if add_generation_prompt -%} | |
| {%- if ns.prev_message_type != 'tool_response' and ns.prev_message_type != 'tool_call' -%} | |
| {{- '<|turn>model\n' -}} | |
| {%- endif -%} | |
| {%- endif -%} | |
| --- | |
| [2026-06-14 14:08:56,434] [WARNING] [axolotl.prompt_strategies.chat_template._validate_eot_and_eos_tokens:357] [PID:3393] EOS token '<eos>' not found in chat_template. Please check if your template/EOS token is correct. | |
| Tokenizing Prompts (num_proc=31): 0%| | 0/4209 [00:00<?, ? examples/s] Tokenizing Prompts (num_proc=31): 3%|ββββ | 136/4209 [00:10<05:02, 13.47 examples/s] Tokenizing Prompts (num_proc=31): 6%|βββββββ | 272/4209 [00:15<03:38, 18.03 examples/s] Tokenizing Prompts (num_proc=31): 10%|βββββββββββ | 408/4209 [00:21<03:05, 20.53 examples/s] Tokenizing Prompts (num_proc=31): 13%|ββββββββββββββ | 544/4209 [00:26<02:38, 23.07 examples/s] Tokenizing Prompts (num_proc=31): 16%|ββββββββββββββββββ | 680/4209 [00:30<02:18, 25.49 examples/s] Tokenizing Prompts (num_proc=31): 19%|ββββββββββββββββββββββ | 816/4209 [00:35<02:09, 26.25 examples/s] Tokenizing Prompts (num_proc=31): 23%|βββββββββββββββββββββββββ | 952/4209 [00:39<01:52, 28.90 examples/s] Tokenizing Prompts (num_proc=31): 26%|ββββββββββββββββββββββββββββ | 1088/4209 [00:43<01:44, 29.81 examples/s] Tokenizing Prompts (num_proc=31): 29%|ββββββββββββββββββββββββββββββββ | 1224/4209 [00:47<01:39, 29.92 examples/s] Tokenizing Prompts (num_proc=31): 32%|βββββββββββββββββββββββββββββββββββ | 1360/4209 [00:51<01:31, 31.06 examples/s] Tokenizing Prompts (num_proc=31): 36%|βββββββββββββββββββββββββββββββββββββββ | 1496/4209 [00:56<01:25, 31.57 examples/s] Tokenizing Prompts (num_proc=31): 39%|ββββββββββββββββββββββββββββββββββββββββββ | 1632/4209 [01:00<01:21, 31.66 examples/s] Tokenizing Prompts (num_proc=31): 42%|ββββββββββββββββββββββββββββββββββββββββββββββ | 1768/4209 [01:03<01:12, 33.71 examples/s] Tokenizing Prompts (num_proc=31): 45%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 1904/4209 [01:07<01:05, 35.43 examples/s] Tokenizing Prompts (num_proc=31): 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2040/4209 [01:12<01:07, 31.92 examples/s] Tokenizing Prompts (num_proc=31): 52%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2176/4209 [01:16<01:03, 32.02 examples/s] Tokenizing Prompts (num_proc=31): 55%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2312/4209 [01:20<00:57, 32.98 examples/s] Tokenizing Prompts (num_proc=31): 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2448/4209 [01:24<00:54, 32.61 examples/s] Tokenizing Prompts (num_proc=31): 61%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2584/4209 [01:28<00:49, 32.89 examples/s] Tokenizing Prompts (num_proc=31): 65%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2720/4209 [01:32<00:44, 33.45 examples/s] Tokenizing Prompts (num_proc=31): 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2856/4209 [01:36<00:38, 35.19 examples/s] Tokenizing Prompts (num_proc=31): 71%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2992/4209 [01:40<00:36, 33.45 examples/s] Tokenizing Prompts (num_proc=31): 74%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3128/4209 [01:44<00:32, 33.28 examples/s] Tokenizing Prompts (num_proc=31): 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3264/4209 [01:48<00:28, 33.11 examples/s] Tokenizing Prompts (num_proc=31): 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3399/4209 [01:52<00:24, 33.56 examples/s] Tokenizing Prompts (num_proc=31): 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3534/4209 [01:56<00:20, 33.67 examples/s] Tokenizing Prompts (num_proc=31): 87%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3669/4209 [02:01<00:17, 31.47 examples/s] Tokenizing Prompts (num_proc=31): 90%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3804/4209 [02:04<00:11, 34.01 examples/s] Tokenizing Prompts (num_proc=31): 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3939/4209 [02:09<00:08, 33.46 examples/s] Tokenizing Prompts (num_proc=31): 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 4074/4209 [02:13<00:04, 33.27 examples/s] Tokenizing Prompts (num_proc=31): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4209/4209 [02:18<00:00, 30.78 examples/s] Tokenizing Prompts (num_proc=31): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4209/4209 [02:19<00:00, 30.16 examples/s] | |
| [2026-06-14 14:11:56,198] [INFO] [axolotl.utils.data.utils._log_dataset_stats:212] [PID:3393] min_input_len: 320 | |
| [2026-06-14 14:11:56,199] [INFO] [axolotl.utils.data.utils._log_dataset_stats:213] [PID:3393] max_input_len: 23372 | |
| Dropping Invalid Sequences (<None or >8192) (num_proc=31): 0%| | 0/4209 [00:00<?, ? examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 3%|βββ | 136/4209 [00:01<00:58, 69.73 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 36%|ββββββββββββββββββββββββββββββ | 1496/4209 [00:02<00:02, 992.91 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 55%|βββββββββββββββββββββββββββββββββββββββββββββ | 2311/4209 [00:02<00:01, 1592.28 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 74%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3126/4209 [00:02<00:00, 2289.56 examples/s] Dropping Invalid Sequences (<None or >8192) (num_proc=31): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4209/4209 [00:02<00:00, 1595.10 examples/s] | |
| [2026-06-14 14:11:58,919] [INFO] [axolotl.utils.data.utils._drop_outside_range:306] [PID:3393] Dropped 15 sequences outside valid range ([None, 8192]) | |
| Saving the dataset (0/16 shards): 0%| | 0/4194 [00:00<?, ? examples/s] Saving the dataset (0/16 shards): 6%|βββββββ | 263/4194 [00:14<03:31, 18.60 examples/s] Saving the dataset (1/16 shards): 13%|ββββββββββββββ | 526/4194 [00:14<03:17, 18.60 examples/s] Saving the dataset (2/16 shards): 19%|βββββββββββββββββββββ | 788/4194 [00:14<03:03, 18.60 examples/s] Saving the dataset (3/16 shards): 19%|βββββββββββββββββββββ | 788/4194 [00:14<03:03, 18.60 examples/s] Saving the dataset (4/16 shards): 25%|βββββββββββββββββββββββββββ | 1050/4194 [00:14<02:49, 18.60 examples/s] Saving the dataset (5/16 shards): 31%|ββββββββββββββββββββββββββββββββββ | 1312/4194 [00:14<02:34, 18.60 examples/s] Saving the dataset (6/16 shards): 38%|βββββββββββββββββββββββββββββββββββββββββ | 1574/4194 [00:14<02:20, 18.60 examples/s] Saving the dataset (7/16 shards): 44%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 1836/4194 [00:14<02:06, 18.60 examples/s] Saving the dataset (8/16 shards): 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2098/4194 [00:14<01:52, 18.60 examples/s] Saving the dataset (9/16 shards): 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2360/4194 [00:14<01:38, 18.60 examples/s] Saving the dataset (10/16 shards): 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2622/4194 [00:14<01:24, 18.60 examples/s] Saving the dataset (11/16 shards): 69%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2884/4194 [00:14<01:10, 18.60 examples/s] Saving the dataset (12/16 shards): 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3146/4194 [00:14<00:56, 18.60 examples/s] Saving the dataset (13/16 shards): 81%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3408/4194 [00:14<00:42, 18.60 examples/s] Saving the dataset (14/16 shards): 88%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3670/4194 [00:14<00:28, 18.60 examples/s] Saving the dataset (15/16 shards): 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 3932/4194 [00:14<00:14, 18.60 examples/s] Saving the dataset (16/16 shards): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4194/4194 [00:14<00:00, 18.60 examples/s] Saving the dataset (16/16 shards): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4194/4194 [00:15<00:00, 270.70 examples/s] | |
| [2026-06-14 14:12:15,030] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:420] [PID:3393] total_num_tokens: 6_012_949 | |
| [2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:438] [PID:3393] `total_supervised_tokens: 3_764_690` | |
| [2026-06-14 14:12:15,143] [DEBUG] [axolotl.utils.trainer.calculate_total_num_steps:521] [PID:3393] total_num_steps: 263 | |
| [2026-06-14 14:12:15,143] [INFO] [axolotl.utils.data.sft._prepare_standard_dataset:121] [PID:3393] Maximum number of steps set at 263 | |
| [2026-06-14 14:12:15,405] [DEBUG] [axolotl.train.setup_model_and_tokenizer:70] [PID:3393] loading tokenizer... google/gemma-4-E4B-it | |
| [2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:311] [PID:3393] EOS: 1 / <eos> | |
| [2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:312] [PID:3393] BOS: 2 / <bos> | |
| [2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:313] [PID:3393] PAD: 0 / <pad> | |
| [2026-06-14 14:12:19,460] [DEBUG] [axolotl.loaders.tokenizer.load_tokenizer:314] [PID:3393] UNK: 3 / <unk> | |
| [2026-06-14 14:12:24,886] [DEBUG] [axolotl.train.setup_model_and_tokenizer:81] [PID:3393] Loading model | |
| [2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:75] [PID:3393] Patched OptimState8bit for torch.compile compatibility | |
| [2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:122] [PID:3393] Patched OptimState4bit for torch.compile compatibility | |
| [2026-06-14 14:12:24,930] [DEBUG] [axolotl.monkeypatch.torchao_optim.patch_torchao_optim_state_8bit:154] [PID:3393] Patched OptimStateFp8 for torch.compile compatibility | |
| [2026-06-14 14:12:24,936] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_evaluation_loop:94] [PID:3393] Patched Trainer.evaluation_loop with nanmean loss calculation | |
| [2026-06-14 14:12:24,937] [DEBUG] [axolotl.monkeypatch.transformers.trainer_loss_calc.patch_maybe_log_save_evaluate:148] [PID:3393] Patched Trainer._maybe_log_save_evaluate with nanmean loss calculation | |
| [2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:207] [PID:3393] Patched Gemma4TextAttention.forward with fused RMSNorm+RoPE Triton kernels | |
| [2026-06-14 14:12:25,040] [INFO] [axolotl.monkeypatch.models.gemma4.fused_attn.patch_gemma4_fused_attn:211] [PID:3393] Installed Gemma4 shared_kv_states side channel (PR #3611) | |
| [2026-06-14 14:12:25,062] [INFO] [axolotl.integrations.cut_cross_entropy.pre_model_load:94] [PID:3393] Applying Cut Cross Entropy to model type: gemma4 | |
| Loading weights: 0%| | 0/2076 [00:00<?, ?it/s] Loading weights: 2%|βββ | 43/2076 [00:00<00:04, 421.41it/s] Loading weights: 4%|ββββββ | 86/2076 [00:00<00:04, 408.21it/s] Loading weights: 7%|βββββββββ | 139/2076 [00:00<00:04, 422.30it/s] Loading weights: 9%|ββββββββββββ | 189/2076 [00:00<00:04, 437.88it/s] Loading weights: 11%|βββββββββββββββ | 233/2076 [00:00<00:04, 414.25it/s] Loading weights: 13%|ββββββββββββββββββ | 275/2076 [00:00<00:05, 357.36it/s] Loading weights: 15%|βββββββββββββββββββββ | 318/2076 [00:00<00:04, 371.39it/s] Loading weights: 18%|ββββββββββββββββββββββββ | 370/2076 [00:00<00:04, 408.91it/s] Loading weights: 20%|βββββββββββββββββββββββββββ | 412/2076 [00:01<00:04, 376.62it/s] Loading weights: 22%|βββββββββββββββββββββββββββββ | 451/2076 [00:01<00:04, 358.21it/s] Loading weights: 24%|ββββββββββββββββββββββββββββββββ | 499/2076 [00:01<00:04, 387.37it/s] Loading weights: 26%|βββββββββββββββββββββββββββββββββββ | 539/2076 [00:01<00:04, 372.42it/s] Loading weights: 28%|βββββββββββββββββββββββββββββββββββββ | 577/2076 [00:01<00:04, 361.99it/s] Loading weights: 30%|ββββββββββββββββββββββββββββββββββββββββ | 623/2076 [00:01<00:03, 386.03it/s] Loading weights: 32%|βββββββββββββββββββββββββββββββββββββββββββ | 663/2076 [00:01<00:03, 376.76it/s] Loading weights: 34%|βββββββββββββββββββββββββββββββββββββββββββββ | 702/2076 [00:01<00:03, 343.95it/s] Loading weights: 36%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 753/2076 [00:01<00:03, 379.34it/s] Loading weights: 38%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 792/2076 [00:04<00:28, 44.87it/s] Loading weights: 39%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 820/2076 [00:05<00:29, 42.60it/s] Loading weights: 41%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 841/2076 [00:06<00:29, 42.47it/s] Loading weights: 41%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 857/2076 [00:06<00:30, 39.90it/s] Loading weights: 42%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 869/2076 [00:07<00:32, 37.29it/s] Loading weights: 42%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 879/2076 [00:07<00:35, 33.32it/s] Loading weights: 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 894/2076 [00:07<00:30, 38.83it/s] Loading weights: 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 902/2076 [00:08<00:31, 37.01it/s] Loading weights: 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 911/2076 [00:08<00:31, 37.35it/s] Loading weights: 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 917/2076 [00:08<00:34, 33.24it/s] Loading weights: 45%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 928/2076 [00:08<00:31, 36.45it/s] Loading weights: 45%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 933/2076 [00:09<00:36, 30.96it/s] Loading weights: 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 945/2076 [00:09<00:30, 36.87it/s] Loading weights: 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 950/2076 [00:09<00:37, 29.74it/s] Loading weights: 46%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 958/2076 [00:09<00:31, 35.53it/s] Loading weights: 46%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 963/2076 [00:10<00:38, 29.12it/s] Loading weights: 47%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 967/2076 [00:10<00:38, 28.53it/s] Loading weights: 47%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 979/2076 [00:10<00:28, 38.12it/s] Loading weights: 47%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 984/2076 [00:10<00:34, 31.40it/s] Loading weights: 48%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 996/2076 [00:10<00:28, 38.38it/s] Loading weights: 48%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1001/2076 [00:11<00:35, 30.58it/s] Loading weights: 49%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1013/2076 [00:11<00:27, 38.54it/s] Loading weights: 49%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1018/2076 [00:11<00:34, 30.96it/s] Loading weights: 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1030/2076 [00:11<00:26, 39.49it/s] Loading weights: 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1035/2076 [00:12<00:32, 32.23it/s] Loading weights: 50%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1047/2076 [00:12<00:25, 40.18it/s] Loading weights: 51%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1052/2076 [00:12<00:31, 32.63it/s] Loading weights: 51%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1060/2076 [00:12<00:26, 39.02it/s] Loading weights: 51%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1065/2076 [00:12<00:32, 31.33it/s] Loading weights: 51%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1069/2076 [00:13<00:33, 29.94it/s] Loading weights: 52%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1081/2076 [00:13<00:24, 40.31it/s] Loading weights: 52%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1086/2076 [00:13<00:31, 31.80it/s] Loading weights: 53%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1098/2076 [00:13<00:24, 39.81it/s] Loading weights: 53%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1103/2076 [00:13<00:30, 32.41it/s] Loading weights: 54%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1115/2076 [00:14<00:24, 39.72it/s] Loading weights: 54%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1120/2076 [00:14<00:29, 32.22it/s] Loading weights: 55%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1132/2076 [00:14<00:24, 38.07it/s] Loading weights: 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1137/2076 [00:14<00:30, 31.08it/s] Loading weights: 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1149/2076 [00:15<00:24, 38.34it/s] Loading weights: 56%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1154/2076 [00:15<00:29, 31.05it/s] Loading weights: 56%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1162/2076 [00:15<00:24, 37.87it/s] Loading weights: 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1167/2076 [00:15<00:29, 31.18it/s] Loading weights: 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1171/2076 [00:15<00:30, 30.09it/s] Loading weights: 57%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1180/2076 [00:16<00:24, 37.17it/s] Loading weights: 57%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1185/2076 [00:16<00:30, 28.83it/s] Loading weights: 58%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1194/2076 [00:16<00:25, 34.39it/s] Loading weights: 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1198/2076 [00:16<00:35, 24.64it/s] Loading weights: 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1208/2076 [00:17<00:29, 29.76it/s] Loading weights: 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1212/2076 [00:17<00:36, 23.76it/s] Loading weights: 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1222/2076 [00:17<00:27, 30.84it/s] Loading weights: 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1226/2076 [00:18<00:34, 25.00it/s] Loading weights: 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1236/2076 [00:18<00:26, 31.77it/s] Loading weights: 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1240/2076 [00:18<00:34, 24.48it/s] Loading weights: 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1247/2076 [00:18<00:27, 30.31it/s] Loading weights: 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1251/2076 [00:18<00:33, 24.70it/s] Loading weights: 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1255/2076 [00:19<00:32, 25.20it/s] Loading weights: 61%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1264/2076 [00:19<00:26, 30.67it/s] Loading weights: 61%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1268/2076 [00:19<00:34, 23.41it/s] Loading weights: 62%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1278/2076 [00:19<00:26, 30.22it/s] Loading weights: 62%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1282/2076 [00:20<00:31, 24.81it/s] Loading weights: 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1292/2076 [00:20<00:25, 31.02it/s] Loading weights: 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1296/2076 [00:20<00:32, 23.64it/s] Loading weights: 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1306/2076 [00:20<00:25, 30.34it/s] Loading weights: 63%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1310/2076 [00:21<00:32, 23.87it/s] Loading weights: 64%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1320/2076 [00:21<00:25, 29.52it/s] Loading weights: 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1324/2076 [00:21<00:30, 24.38it/s] Loading weights: 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1331/2076 [00:21<00:24, 30.50it/s] Loading weights: 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1335/2076 [00:22<00:30, 24.15it/s] Loading weights: 64%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1339/2076 [00:22<00:29, 25.27it/s] Loading weights: 65%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1348/2076 [00:22<00:22, 32.23it/s] Loading weights: 65%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1352/2076 [00:22<00:29, 24.95it/s] Loading weights: 66%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1362/2076 [00:22<00:22, 32.00it/s] Loading weights: 66%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1366/2076 [00:23<00:28, 25.10it/s] Loading weights: 66%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1376/2076 [00:23<00:21, 31.93it/s] Loading weights: 66%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1380/2076 [00:23<00:27, 25.08it/s] Loading weights: 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1390/2076 [00:23<00:21, 32.17it/s] Loading weights: 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1394/2076 [00:24<00:27, 24.57it/s] Loading weights: 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1404/2076 [00:24<00:21, 31.22it/s] Loading weights: 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1408/2076 [00:24<00:26, 25.40it/s] Loading weights: 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1415/2076 [00:24<00:21, 30.97it/s] Loading weights: 68%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1419/2076 [00:24<00:22, 29.70it/s] Loading weights: 76%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1569/2076 [00:24<00:01, 293.08it/s] Loading weights: 83%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1714/2076 [00:25<00:00, 528.51it/s] Loading weights: 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 1878/2076 [00:25<00:00, 774.40it/s] Loading weights: 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 2042/2076 [00:25<00:00, 981.87it/s] Loading weights: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2076/2076 [00:25<00:00, 81.97it/s] | |
| [2026-06-14 14:12:52,761] [INFO] [axolotl.loaders.model._prepare_model_for_quantization:977] [PID:3393] converting PEFT model w/ prepare_model_for_kbit_training | |
| [2026-06-14 14:12:52,780] [INFO] [axolotl.loaders.model._configure_embedding_dtypes:433] [PID:3393] Converting modules to torch.bfloat16 | |
| [2026-06-14 14:12:52,965] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] Memory usage after model load 22.419GB (+22.419GB allocated, +28.939GB reserved) | |
| trainable params: 34,881,536 || all params: 7,975,982,368 || trainable%: 0.4373 | |
| [2026-06-14 14:12:53,476] [DEBUG] [axolotl.loaders.model.log_gpu_memory_usage:127] [PID:3393] after adapters 10.827GB (+10.827GB allocated, +29.068GB reserved) | |
| [2026-06-14 14:12:55,145] [INFO] [axolotl.utils.freeze.freeze_mm_modules:49] [PID:3393] freeze_mm_modules: froze 0 vision/audio parameters | |
| [2026-06-14 14:12:56,196] [INFO] [axolotl.core.trainers.mixins.layer_offloading.__init__:291] [PID:3393] Layer parameter offloading enabled | |
| [2026-06-14 14:12:56,197] [WARNING] [axolotl.core.trainers.mixins.layer_offloading.__init__:73] [PID:3393] LayerOffloadManager: no decoder layers found, offloading disabled | |
| [2026-06-14 14:12:56,197] [INFO] [axolotl.train.save_initial_configs:450] [PID:3393] Pre-saving adapter config to ./outputs/Jacob-2-E4B... | |
| [2026-06-14 14:12:56,198] [INFO] [axolotl.train.save_initial_configs:454] [PID:3393] Pre-saving tokenizer to ./outputs/Jacob-2-E4B... | |
| [2026-06-14 14:12:56,696] [INFO] [axolotl.train.save_initial_configs:459] [PID:3393] Pre-saving model config to ./outputs/Jacob-2-E4B... | |
| [2026-06-14 14:12:56,700] [INFO] [axolotl.train.save_initial_configs:463] [PID:3393] Pre-saving processor to ./outputs/Jacob-2-E4B... | |
| [2026-06-14 14:12:57,113] [INFO] [axolotl.train.execute_training:226] [PID:3393] Starting trainer... | |
| 0%| | 0/263 [00:00<?, ?it/s][2026-06-14 14:13:00,854] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:13:01,453] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:13:11,380] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:13:46,333] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:13:46,745] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:14:09,129] [INFO] [axolotl.kernels.autotune_telemetry.on_step_end:133] [PID:3393] Reported 2 fused-rope kernel autotune config(s) to telemetry. | |
| 0%|β | 1/263 [01:11<5:11:11, 71.27s/it] {'loss': '1.576', 'grad_norm': '0.5319', 'learning_rate': '0', 'ppl': '4.836', 'memory/max_active (GiB)': '35.37', 'memory/max_allocated (GiB)': '35.37', 'memory/device_reserved (GiB)': '44.53', 'tokens/trainable': 13646, 'tokens/total': 34320, 'epoch': '0.003815'} | |
| 0%|β | 1/263 [01:11<5:11:11, 71.27s/it] 1%|ββ | 2/263 [02:07<4:31:48, 62.49s/it] {'loss': '1.541', 'grad_norm': '0.4915', 'learning_rate': '7.692e-06', 'ppl': '4.671', 'memory/max_active (GiB)': '37.15', 'memory/max_allocated (GiB)': '37.15', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '272.8', 'tokens/trainable': 29007, 'tokens/total': 65680, 'epoch': '0.00763'} | |
| 1%|ββ | 2/263 [02:07<4:31:48, 62.49s/it] 1%|ββ | 3/263 [02:52<3:55:45, 54.40s/it] {'loss': '1.676', 'grad_norm': '0.5334', 'learning_rate': '1.538e-05', 'ppl': '5.345', 'memory/max_active (GiB)': '26.41', 'memory/max_allocated (GiB)': '26.41', 'memory/device_reserved (GiB)': '34.21', 'tokens/train_per_sec_per_gpu': '241.3', 'tokens/trainable': 39814, 'tokens/total': 89258, 'epoch': '0.01144'} | |
| 1%|ββ | 3/263 [02:52<3:55:45, 54.40s/it] 2%|βββ | 4/263 [03:26<3:20:53, 46.54s/it] {'loss': '1.701', 'grad_norm': '0.6123', 'learning_rate': '2.308e-05', 'ppl': '5.478', 'memory/max_active (GiB)': '22.96', 'memory/max_allocated (GiB)': '22.96', 'memory/device_reserved (GiB)': '25.77', 'tokens/train_per_sec_per_gpu': '256.6', 'tokens/trainable': 48660, 'tokens/total': 108024, 'epoch': '0.01526'} | |
| 2%|βββ | 4/263 [03:26<3:20:53, 46.54s/it] 2%|βββ | 5/263 [04:08<3:12:35, 44.79s/it] {'loss': '1.721', 'grad_norm': '0.6678', 'learning_rate': '3.077e-05', 'ppl': '5.59', 'memory/max_active (GiB)': '24.42', 'memory/max_allocated (GiB)': '24.42', 'memory/device_reserved (GiB)': '31.26', 'tokens/train_per_sec_per_gpu': '200.1', 'tokens/trainable': 57004, 'tokens/total': 127360, 'epoch': '0.01907'} | |
| 2%|βββ | 5/263 [04:08<3:12:35, 44.79s/it] 2%|ββββ | 6/263 [05:00<3:22:41, 47.32s/it] {'loss': '1.441', 'grad_norm': '0.5136', 'learning_rate': '3.846e-05', 'ppl': '4.225', 'memory/max_active (GiB)': '43.28', 'memory/max_allocated (GiB)': '43.28', 'memory/device_reserved (GiB)': '60.49', 'tokens/train_per_sec_per_gpu': '340.5', 'tokens/trainable': 74786, 'tokens/total': 165128, 'epoch': '0.02289'} | |
| 2%|ββββ | 6/263 [05:00<3:22:41, 47.32s/it] 3%|ββββ | 7/263 [05:43<3:15:55, 45.92s/it] {'loss': '1.527', 'grad_norm': '0.5568', 'learning_rate': '4.615e-05', 'ppl': '4.603', 'memory/max_active (GiB)': '38.81', 'memory/max_allocated (GiB)': '38.81', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '290.1', 'tokens/trainable': 87272, 'tokens/total': 194790, 'epoch': '0.0267'} | |
| 3%|ββββ | 7/263 [05:43<3:15:55, 45.92s/it] 3%|βββββ | 8/263 [06:32<3:19:30, 46.94s/it] {'loss': '1.629', 'grad_norm': '0.7715', 'learning_rate': '5.385e-05', 'ppl': '5.098', 'memory/max_active (GiB)': '34.9', 'memory/max_allocated (GiB)': '34.9', 'memory/device_reserved (GiB)': '53.62', 'tokens/train_per_sec_per_gpu': '247.6', 'tokens/trainable': 99438, 'tokens/total': 223594, 'epoch': '0.03052'} | |
| 3%|βββββ | 8/263 [06:32<3:19:30, 46.94s/it] 3%|ββββββ | 9/263 [07:20<3:19:53, 47.22s/it] {'loss': '1.545', 'grad_norm': '0.7755', 'learning_rate': '6.154e-05', 'ppl': '4.688', 'memory/max_active (GiB)': '42.92', 'memory/max_allocated (GiB)': '42.92', 'memory/device_reserved (GiB)': '59.79', 'tokens/train_per_sec_per_gpu': '298.1', 'tokens/trainable': 113695, 'tokens/total': 256322, 'epoch': '0.03433'} | |
| 3%|ββββββ | 9/263 [07:20<3:19:53, 47.22s/it] 4%|ββββββ | 10/263 [08:10<3:21:47, 47.86s/it] {'loss': '1.348', 'grad_norm': '0.6723', 'learning_rate': '6.923e-05', 'ppl': '3.848', 'memory/max_active (GiB)': '28.92', 'memory/max_allocated (GiB)': '28.92', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '238.3', 'tokens/trainable': 125442, 'tokens/total': 282552, 'epoch': '0.03815'} | |
| 4%|ββββββ | 10/263 [08:10<3:21:47, 47.86s/it] 4%|βββββββ | 11/263 [08:50<3:10:58, 45.47s/it] {'loss': '1.337', 'grad_norm': '0.701', 'learning_rate': '7.692e-05', 'ppl': '3.808', 'memory/max_active (GiB)': '25.92', 'memory/max_allocated (GiB)': '25.92', 'memory/device_reserved (GiB)': '33.79', 'tokens/train_per_sec_per_gpu': '275.6', 'tokens/trainable': 136483, 'tokens/total': 306800, 'epoch': '0.04196'} | |
| 4%|βββββββ | 11/263 [08:50<3:10:58, 45.47s/it] 5%|βββββββ | 12/263 [09:22<2:53:32, 41.48s/it] {'loss': '1.188', 'grad_norm': '0.5657', 'learning_rate': '8.462e-05', 'ppl': '3.28', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.74', 'tokens/train_per_sec_per_gpu': '364.8', 'tokens/trainable': 148289, 'tokens/total': 329988, 'epoch': '0.04578'} | |
| 5%|βββββββ | 12/263 [09:22<2:53:32, 41.48s/it] 5%|ββββββββ | 13/263 [10:18<3:11:41, 46.01s/it] {'loss': '1.115', 'grad_norm': '0.5181', 'learning_rate': '9.231e-05', 'ppl': '3.048', 'memory/max_active (GiB)': '48.15', 'memory/max_allocated (GiB)': '48.15', 'memory/device_reserved (GiB)': '67.89', 'tokens/train_per_sec_per_gpu': '312.6', 'tokens/trainable': 165922, 'tokens/total': 365046, 'epoch': '0.04959'} | |
| 5%|ββββββββ | 13/263 [10:18<3:11:41, 46.01s/it] 5%|ββββββββ | 14/263 [11:08<3:15:29, 47.11s/it] {'loss': '1.097', 'grad_norm': '0.5891', 'learning_rate': '0.0001', 'ppl': '2.995', 'memory/max_active (GiB)': '37.46', 'memory/max_allocated (GiB)': '37.46', 'memory/device_reserved (GiB)': '47.39', 'tokens/train_per_sec_per_gpu': '291.3', 'tokens/trainable': 180382, 'tokens/total': 398168, 'epoch': '0.05341'} | |
| 5%|ββββββββ | 14/263 [11:08<3:15:29, 47.11s/it] 6%|βββββββββ | 15/263 [11:59<3:19:12, 48.19s/it] {'loss': '1.005', 'grad_norm': '0.5228', 'learning_rate': '0.0001077', 'ppl': '2.732', 'memory/max_active (GiB)': '39.68', 'memory/max_allocated (GiB)': '39.68', 'memory/device_reserved (GiB)': '55.06', 'tokens/train_per_sec_per_gpu': '245.6', 'tokens/trainable': 192840, 'tokens/total': 425746, 'epoch': '0.05722'} | |
| 6%|βββββββββ | 15/263 [11:59<3:19:12, 48.19s/it] 6%|ββββββββββ | 16/263 [12:52<3:24:35, 49.70s/it] {'loss': '0.871', 'grad_norm': '0.36', 'learning_rate': '0.0001154', 'ppl': '2.389', 'memory/max_active (GiB)': '50.71', 'memory/max_allocated (GiB)': '50.71', 'memory/device_reserved (GiB)': '71.7', 'tokens/train_per_sec_per_gpu': '322.6', 'tokens/trainable': 209997, 'tokens/total': 464790, 'epoch': '0.06104'} | |
| 6%|ββββββββββ | 16/263 [12:52<3:24:35, 49.70s/it] 6%|ββββββββββ | 17/263 [13:45<3:27:34, 50.63s/it] {'loss': '0.98', 'grad_norm': '0.5478', 'learning_rate': '0.0001231', 'ppl': '2.664', 'memory/max_active (GiB)': '59.65', 'memory/max_allocated (GiB)': '59.65', 'memory/device_reserved (GiB)': '77.29', 'tokens/train_per_sec_per_gpu': '331.7', 'tokens/trainable': 227508, 'tokens/total': 502040, 'epoch': '0.06485'} | |
| 6%|ββββββββββ | 17/263 [13:45<3:27:34, 50.63s/it] 7%|βββββββββββ | 18/263 [14:33<3:24:00, 49.96s/it] {'loss': '1.008', 'grad_norm': '0.3831', 'learning_rate': '0.0001308', 'ppl': '2.74', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '52.24', 'tokens/train_per_sec_per_gpu': '315.6', 'tokens/trainable': 242788, 'tokens/total': 533846, 'epoch': '0.06867'} | |
| 7%|βββββββββββ | 18/263 [14:33<3:24:00, 49.96s/it] 7%|βββββββββββ | 19/263 [15:13<3:10:30, 46.85s/it] {'loss': '0.9226', 'grad_norm': '0.4333', 'learning_rate': '0.0001385', 'ppl': '2.516', 'memory/max_active (GiB)': '41.82', 'memory/max_allocated (GiB)': '41.82', 'memory/device_reserved (GiB)': '58.38', 'tokens/train_per_sec_per_gpu': '382.9', 'tokens/trainable': 257947, 'tokens/total': 566334, 'epoch': '0.07248'} | |
| 7%|βββββββββββ | 19/263 [15:13<3:10:30, 46.85s/it][2026-06-14 14:28:28,050] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| [2026-06-14 14:28:28,323] [WARNING] [py.warnings._showwarnmsg:112] [PID:3393] /workspace/axolotl-venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. | |
| return fn(*args, **kwargs) | |
| 8%|ββββββββββββ | 20/263 [16:11<3:23:38, 50.28s/it] {'loss': '0.9231', 'grad_norm': '0.4565', 'learning_rate': '0.0001462', 'ppl': '2.517', 'memory/max_active (GiB)': '37.11', 'memory/max_allocated (GiB)': '37.11', 'memory/device_reserved (GiB)': '46.94', 'tokens/train_per_sec_per_gpu': '361.3', 'tokens/trainable': 279009, 'tokens/total': 610288, 'epoch': '0.0763'} | |
| 8%|ββββββββββββ | 20/263 [16:11<3:23:38, 50.28s/it] 8%|ββββββββββββ | 21/263 [17:02<3:23:17, 50.40s/it] {'loss': '0.9047', 'grad_norm': '0.5018', 'learning_rate': '0.0001538', 'ppl': '2.471', 'memory/max_active (GiB)': '33.83', 'memory/max_allocated (GiB)': '33.83', 'memory/device_reserved (GiB)': '42.33', 'tokens/train_per_sec_per_gpu': '243', 'tokens/trainable': 291324, 'tokens/total': 637560, 'epoch': '0.08011'} | |
| 8%|ββββββββββββ | 21/263 [17:02<3:23:17, 50.40s/it] 8%|βββββββββββββ | 22/263 [17:44<3:12:51, 48.01s/it] {'loss': '0.892', 'grad_norm': '0.4374', 'learning_rate': '0.0001615', 'ppl': '2.44', 'memory/max_active (GiB)': '31.34', 'memory/max_allocated (GiB)': '31.34', 'memory/device_reserved (GiB)': '41.63', 'tokens/train_per_sec_per_gpu': '440', 'tokens/trainable': 309998, 'tokens/total': 672066, 'epoch': '0.08393'} | |
| 8%|βββββββββββββ | 22/263 [17:44<3:12:51, 48.01s/it] 9%|βββββββββββββ | 23/263 [18:16<2:52:03, 43.01s/it] {'loss': '1', 'grad_norm': '0.3918', 'learning_rate': '0.0001692', 'ppl': '2.719', 'memory/max_active (GiB)': '27.48', 'memory/max_allocated (GiB)': '27.48', 'memory/device_reserved (GiB)': '35.84', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 322055, 'tokens/total': 698356, 'epoch': '0.08774'} | |
| 9%|βββββββββββββ | 23/263 [18:16<2:52:03, 43.01s/it] 9%|ββββββββββββββ | 24/263 [18:42<2:31:15, 37.97s/it] {'loss': '0.9319', 'grad_norm': '0.3207', 'learning_rate': '0.0001769', 'ppl': '2.539', 'memory/max_active (GiB)': '22.66', 'memory/max_allocated (GiB)': '22.66', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '382.2', 'tokens/trainable': 332075, 'tokens/total': 718552, 'epoch': '0.09156'} | |
| 9%|ββββββββββββββ | 24/263 [18:42<2:31:15, 37.97s/it] 10%|βββββββββββββββ | 25/263 [19:26<2:38:17, 39.90s/it] {'loss': '0.8685', 'grad_norm': '0.2337', 'learning_rate': '0.0001846', 'ppl': '2.383', 'memory/max_active (GiB)': '31.11', 'memory/max_allocated (GiB)': '31.11', 'memory/device_reserved (GiB)': '38.38', 'tokens/train_per_sec_per_gpu': '287', 'tokens/trainable': 344822, 'tokens/total': 746628, 'epoch': '0.09537'} | |
| 10%|βββββββββββββββ | 25/263 [19:26<2:38:17, 39.90s/it] 10%|βββββββββββββββ | 26/263 [20:17<2:50:36, 43.19s/it] {'loss': '0.9079', 'grad_norm': '0.2105', 'learning_rate': '0.0001923', 'ppl': '2.479', 'memory/max_active (GiB)': '27.1', 'memory/max_allocated (GiB)': '27.1', 'memory/device_reserved (GiB)': '35.33', 'tokens/train_per_sec_per_gpu': '241.9', 'tokens/trainable': 357129, 'tokens/total': 775900, 'epoch': '0.09919'} | |
| 10%|βββββββββββββββ | 26/263 [20:17<2:50:36, 43.19s/it] 10%|ββββββββββββββββ | 27/263 [20:59<2:48:53, 42.94s/it] {'loss': '0.9097', 'grad_norm': '0.2115', 'learning_rate': '0.0002', 'ppl': '2.484', 'memory/max_active (GiB)': '45.8', 'memory/max_allocated (GiB)': '45.8', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '297.6', 'tokens/trainable': 369730, 'tokens/total': 802730, 'epoch': '0.103'} | |
| 10%|ββββββββββββββββ | 27/263 [20:59<2:48:53, 42.94s/it] 11%|ββββββββββββββββ | 28/263 [21:41<2:46:11, 42.43s/it] {'loss': '0.8107', 'grad_norm': '0.1699', 'learning_rate': '0.0002', 'ppl': '2.249', 'memory/max_active (GiB)': '39.66', 'memory/max_allocated (GiB)': '39.66', 'memory/device_reserved (GiB)': '64.2', 'tokens/train_per_sec_per_gpu': '399.8', 'tokens/trainable': 386221, 'tokens/total': 839556, 'epoch': '0.1068'} | |
| 11%|ββββββββββββββββ | 28/263 [21:41<2:46:11, 42.43s/it] 11%|βββββββββββββββββ | 29/263 [22:13<2:33:22, 39.33s/it] {'loss': '0.8618', 'grad_norm': '0.188', 'learning_rate': '0.0002', 'ppl': '2.367', 'memory/max_active (GiB)': '30.9', 'memory/max_allocated (GiB)': '30.9', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '358.7', 'tokens/trainable': 397730, 'tokens/total': 866216, 'epoch': '0.1106'} | |
| 11%|βββββββββββββββββ | 29/263 [22:13<2:33:22, 39.33s/it] 11%|βββββββββββββββββ | 30/263 [22:51<2:31:11, 38.93s/it] {'loss': '0.7588', 'grad_norm': '0.1495', 'learning_rate': '0.0001999', 'ppl': '2.136', 'memory/max_active (GiB)': '31.81', 'memory/max_allocated (GiB)': '31.81', 'memory/device_reserved (GiB)': '42.5', 'tokens/train_per_sec_per_gpu': '420', 'tokens/trainable': 413695, 'tokens/total': 898918, 'epoch': '0.1144'} | |
| 11%|βββββββββββββββββ | 30/263 [22:51<2:31:11, 38.93s/it] 12%|ββββββββββββββββββ | 31/263 [23:27<2:27:18, 38.10s/it] {'loss': '0.7096', 'grad_norm': '0.1585', 'learning_rate': '0.0001999', 'ppl': '2.033', 'memory/max_active (GiB)': '26.02', 'memory/max_allocated (GiB)': '26.02', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '283.3', 'tokens/trainable': 423936, 'tokens/total': 923272, 'epoch': '0.1183'} | |
| 12%|ββββββββββββββββββ | 31/263 [23:27<2:27:18, 38.10s/it] 12%|βββββββββββββββββββ | 32/263 [24:16<2:39:47, 41.50s/it] {'loss': '0.7129', 'grad_norm': '0.1396', 'learning_rate': '0.0001998', 'ppl': '2.04', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.17', 'tokens/train_per_sec_per_gpu': '434.5', 'tokens/trainable': 445423, 'tokens/total': 968072, 'epoch': '0.1221'} | |
| 12%|βββββββββββββββββββ | 32/263 [24:16<2:39:47, 41.50s/it] 13%|βββββββββββββββββββ | 33/263 [24:52<2:32:16, 39.72s/it] {'loss': '0.8433', 'grad_norm': '0.1598', 'learning_rate': '0.0001997', 'ppl': '2.324', 'memory/max_active (GiB)': '28.78', 'memory/max_allocated (GiB)': '28.78', 'memory/device_reserved (GiB)': '61.66', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 458707, 'tokens/total': 996758, 'epoch': '0.1259'} | |
| 13%|βββββββββββββββββββ | 33/263 [24:52<2:32:16, 39.72s/it] 13%|ββββββββββββββββββββ | 34/263 [25:44<2:46:13, 43.55s/it] {'loss': '0.7034', 'grad_norm': '0.1374', 'learning_rate': '0.0001996', 'ppl': '2.021', 'memory/max_active (GiB)': '47.51', 'memory/max_allocated (GiB)': '47.51', 'memory/device_reserved (GiB)': '67.08', 'tokens/train_per_sec_per_gpu': '380.5', 'tokens/trainable': 478680, 'tokens/total': 1041764, 'epoch': '0.1297'} | |
| 13%|ββββββββββββββββββββ | 34/263 [25:44<2:46:13, 43.55s/it] 13%|ββββββββββββββββββββ | 35/263 [26:23<2:39:55, 42.09s/it] {'loss': '0.705', 'grad_norm': '0.1833', 'learning_rate': '0.0001994', 'ppl': '2.024', 'memory/max_active (GiB)': '35.78', 'memory/max_allocated (GiB)': '35.78', 'memory/device_reserved (GiB)': '45.08', 'tokens/train_per_sec_per_gpu': '333.5', 'tokens/trainable': 491578, 'tokens/total': 1071866, 'epoch': '0.1335'} | |
| 13%|ββββββββββββββββββββ | 35/263 [26:23<2:39:55, 42.09s/it] 14%|βββββββββββββββββββββ | 36/263 [27:25<3:01:14, 47.90s/it] {'loss': '0.7842', 'grad_norm': '0.129', 'learning_rate': '0.0001993', 'ppl': '2.191', 'memory/max_active (GiB)': '60.96', 'memory/max_allocated (GiB)': '60.96', 'memory/device_reserved (GiB)': '68.99', 'tokens/train_per_sec_per_gpu': '468.1', 'tokens/trainable': 520355, 'tokens/total': 1124412, 'epoch': '0.1373'} | |
| 14%|βββββββββββββββββββββ | 36/263 [27:25<3:01:14, 47.90s/it] 14%|βββββββββββββββββββββ | 37/263 [27:58<2:43:52, 43.51s/it] {'loss': '0.81', 'grad_norm': '0.164', 'learning_rate': '0.0001991', 'ppl': '2.248', 'memory/max_active (GiB)': '40.05', 'memory/max_allocated (GiB)': '40.05', 'memory/device_reserved (GiB)': '55.65', 'tokens/train_per_sec_per_gpu': '384.6', 'tokens/trainable': 533138, 'tokens/total': 1152862, 'epoch': '0.1412'} | |
| 14%|βββββββββββββββββββββ | 37/263 [27:58<2:43:52, 43.51s/it] 14%|ββββββββββββββββββββββ | 38/263 [28:42<2:43:52, 43.70s/it] {'loss': '0.6781', 'grad_norm': '0.1564', 'learning_rate': '0.0001989', 'ppl': '1.97', 'memory/max_active (GiB)': '49.18', 'memory/max_allocated (GiB)': '49.18', 'memory/device_reserved (GiB)': '69.5', 'tokens/train_per_sec_per_gpu': '413.4', 'tokens/trainable': 551391, 'tokens/total': 1190852, 'epoch': '0.145'} | |
| 14%|ββββββββββββββββββββββ | 38/263 [28:42<2:43:52, 43.70s/it] 15%|βββββββββββββββββββββββ | 39/263 [29:17<2:33:07, 41.01s/it] {'loss': '0.8666', 'grad_norm': '0.1817', 'learning_rate': '0.0001987', 'ppl': '2.379', 'memory/max_active (GiB)': '42.54', 'memory/max_allocated (GiB)': '42.54', 'memory/device_reserved (GiB)': '59.24', 'tokens/train_per_sec_per_gpu': '349.9', 'tokens/trainable': 563547, 'tokens/total': 1219812, 'epoch': '0.1488'} | |
| 15%|βββββββββββββββββββββββ | 39/263 [29:17<2:33:07, 41.01s/it] 15%|βββββββββββββββββββββββ | 40/263 [29:53<2:27:38, 39.72s/it] {'loss': '0.7456', 'grad_norm': '0.1504', 'learning_rate': '0.0001985', 'ppl': '2.108', 'memory/max_active (GiB)': '37.56', 'memory/max_allocated (GiB)': '37.56', 'memory/device_reserved (GiB)': '47.6', 'tokens/train_per_sec_per_gpu': '429.8', 'tokens/trainable': 579326, 'tokens/total': 1251662, 'epoch': '0.1526'} | |
| 15%|βββββββββββββββββββββββ | 40/263 [29:53<2:27:38, 39.72s/it] 16%|ββββββββββββββββββββββββ | 41/263 [30:30<2:23:42, 38.84s/it] {'loss': '0.7436', 'grad_norm': '0.1403', 'learning_rate': '0.0001983', 'ppl': '2.103', 'memory/max_active (GiB)': '35.11', 'memory/max_allocated (GiB)': '35.11', 'memory/device_reserved (GiB)': '44.3', 'tokens/train_per_sec_per_gpu': '410.9', 'tokens/trainable': 594442, 'tokens/total': 1285326, 'epoch': '0.1564'} | |
| 16%|ββββββββββββββββββββββββ | 41/263 [30:30<2:23:42, 38.84s/it] 16%|ββββββββββββββββββββββββ | 42/263 [31:12<2:26:40, 39.82s/it] {'loss': '0.752', 'grad_norm': '0.1654', 'learning_rate': '0.000198', 'ppl': '2.121', 'memory/max_active (GiB)': '31.65', 'memory/max_allocated (GiB)': '31.65', 'memory/device_reserved (GiB)': '41.45', 'tokens/train_per_sec_per_gpu': '317.2', 'tokens/trainable': 607800, 'tokens/total': 1315504, 'epoch': '0.1602'} | |
| 16%|ββββββββββββββββββββββββ | 42/263 [31:12<2:26:40, 39.82s/it] 16%|βββββββββββββββββββββββββ | 43/263 [31:54<2:28:01, 40.37s/it] {'loss': '0.823', 'grad_norm': '0.1637', 'learning_rate': '0.0001978', 'ppl': '2.277', 'memory/max_active (GiB)': '26.15', 'memory/max_allocated (GiB)': '26.15', 'memory/device_reserved (GiB)': '28.68', 'tokens/train_per_sec_per_gpu': '277.6', 'tokens/trainable': 619363, 'tokens/total': 1338730, 'epoch': '0.164'} | |
| 16%|βββββββββββββββββββββββββ | 43/263 [31:54<2:28:01, 40.37s/it] 17%|βββββββββββββββββββββββββ | 44/263 [32:34<2:26:37, 40.17s/it] {'loss': '0.7822', 'grad_norm': '0.1594', 'learning_rate': '0.0001975', 'ppl': '2.186', 'memory/max_active (GiB)': '36.38', 'memory/max_allocated (GiB)': '36.38', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '375.6', 'tokens/trainable': 634277, 'tokens/total': 1370806, 'epoch': '0.1679'} | |
| 17%|βββββββββββββββββββββββββ | 44/263 [32:34<2:26:37, 40.17s/it] 17%|ββββββββββββββββββββββββββ | 45/263 [33:11<2:23:02, 39.37s/it] {'loss': '0.7665', 'grad_norm': '0.1315', 'learning_rate': '0.0001972', 'ppl': '2.152', 'memory/max_active (GiB)': '32.6', 'memory/max_allocated (GiB)': '32.6', 'memory/device_reserved (GiB)': '46.06', 'tokens/train_per_sec_per_gpu': '406.5', 'tokens/trainable': 649519, 'tokens/total': 1403632, 'epoch': '0.1717'} | |
| 17%|ββββββββββββββββββββββββββ | 45/263 [33:11<2:23:02, 39.37s/it] 17%|βββββββββββββββββββββββββββ | 46/263 [33:41<2:12:01, 36.50s/it] {'loss': '0.7592', 'grad_norm': '0.1396', 'learning_rate': '0.0001968', 'ppl': '2.137', 'memory/max_active (GiB)': '22.71', 'memory/max_allocated (GiB)': '22.71', 'memory/device_reserved (GiB)': '33.91', 'tokens/train_per_sec_per_gpu': '378.9', 'tokens/trainable': 660815, 'tokens/total': 1426008, 'epoch': '0.1755'} | |
| 17%|βββββββββββββββββββββββββββ | 46/263 [33:41<2:12:01, 36.50s/it] 18%|βββββββββββββββββββββββββββ | 47/263 [34:20<2:14:37, 37.40s/it] {'loss': '0.7492', 'grad_norm': '0.1523', 'learning_rate': '0.0001965', 'ppl': '2.115', 'memory/max_active (GiB)': '30.88', 'memory/max_allocated (GiB)': '30.88', 'memory/device_reserved (GiB)': '40.8', 'tokens/train_per_sec_per_gpu': '300.6', 'tokens/trainable': 672685, 'tokens/total': 1452446, 'epoch': '0.1793'} | |
| 18%|βββββββββββββββββββββββββββ | 47/263 [34:20<2:14:37, 37.40s/it] 18%|ββββββββββββββββββββββββββββ | 48/263 [35:11<2:28:02, 41.32s/it] {'loss': '0.7758', 'grad_norm': '0.1287', 'learning_rate': '0.0001962', 'ppl': '2.172', 'memory/max_active (GiB)': '46.66', 'memory/max_allocated (GiB)': '46.66', 'memory/device_reserved (GiB)': '65.45', 'tokens/train_per_sec_per_gpu': '309.9', 'tokens/trainable': 688320, 'tokens/total': 1484130, 'epoch': '0.1831'} | |
| 18%|ββββββββββββββββββββββββββββ | 48/263 [35:11<2:28:02, 41.32s/it] 19%|ββββββββββββββββββββββββββββ | 49/263 [35:50<2:24:35, 40.54s/it] {'loss': '0.723', 'grad_norm': '0.1277', 'learning_rate': '0.0001958', 'ppl': '2.061', 'memory/max_active (GiB)': '31.37', 'memory/max_allocated (GiB)': '31.37', 'memory/device_reserved (GiB)': '41.67', 'tokens/train_per_sec_per_gpu': '351.3', 'tokens/trainable': 701926, 'tokens/total': 1512564, 'epoch': '0.1869'} | |
| 19%|ββββββββββββββββββββββββββββ | 49/263 [35:50<2:24:35, 40.54s/it] 19%|βββββββββββββββββββββββββββββ | 50/263 [36:34<2:27:53, 41.66s/it] {'loss': '0.6734', 'grad_norm': '0.1139', 'learning_rate': '0.0001954', 'ppl': '1.961', 'memory/max_active (GiB)': '38.79', 'memory/max_allocated (GiB)': '38.79', 'memory/device_reserved (GiB)': '53.5', 'tokens/train_per_sec_per_gpu': '458.3', 'tokens/trainable': 722216, 'tokens/total': 1551038, 'epoch': '0.1907'} | |
| 19%|βββββββββββββββββββββββββββββ | 50/263 [36:34<2:27:53, 41.66s/it] 19%|βββββββββββββββββββββββββββββ | 51/263 [37:28<2:40:22, 45.39s/it] {'loss': '0.7031', 'grad_norm': '0.1693', 'learning_rate': '0.000195', 'ppl': '2.02', 'memory/max_active (GiB)': '45.1', 'memory/max_allocated (GiB)': '45.1', 'memory/device_reserved (GiB)': '63.62', 'tokens/train_per_sec_per_gpu': '270.4', 'tokens/trainable': 736843, 'tokens/total': 1585540, 'epoch': '0.1946'} | |
| 19%|βββββββββββββββββββββββββββββ | 51/263 [37:28<2:40:22, 45.39s/it] 20%|ββββββββββββββββββββββββββββββ | 52/263 [38:10<2:36:32, 44.52s/it] {'loss': '0.7569', 'grad_norm': '0.1728', 'learning_rate': '0.0001946', 'ppl': '2.132', 'memory/max_active (GiB)': '20.75', 'memory/max_allocated (GiB)': '20.75', 'memory/device_reserved (GiB)': '25.67', 'tokens/train_per_sec_per_gpu': '179.7', 'tokens/trainable': 744474, 'tokens/total': 1602602, 'epoch': '0.1984'} | |
| 20%|ββββββββββββββββββββββββββββββ | 52/263 [38:10<2:36:32, 44.52s/it] 20%|βββββββββββββββββββββββββββββββ | 53/263 [38:53<2:33:22, 43.82s/it] {'loss': '0.81', 'grad_norm': '0.1633', 'learning_rate': '0.0001941', 'ppl': '2.248', 'memory/max_active (GiB)': '27.01', 'memory/max_allocated (GiB)': '27.01', 'memory/device_reserved (GiB)': '35.24', 'tokens/train_per_sec_per_gpu': '223.7', 'tokens/trainable': 753915, 'tokens/total': 1625426, 'epoch': '0.2022'} | |
| 20%|βββββββββββββββββββββββββββββββ | 53/263 [38:53<2:33:22, 43.82s/it] 21%|βββββββββββββββββββββββββββββββ | 54/263 [39:42<2:37:58, 45.35s/it] {'loss': '0.7126', 'grad_norm': '0.1385', 'learning_rate': '0.0001937', 'ppl': '2.039', 'memory/max_active (GiB)': '37.95', 'memory/max_allocated (GiB)': '37.95', 'memory/device_reserved (GiB)': '48.03', 'tokens/train_per_sec_per_gpu': '320.1', 'tokens/trainable': 769572, 'tokens/total': 1660362, 'epoch': '0.206'} | |
| 21%|βββββββββββββββββββββββββββββββ | 54/263 [39:42<2:37:58, 45.35s/it] 21%|ββββββββββββββββββββββββββββββββ | 55/263 [40:31<2:41:57, 46.72s/it] {'loss': '0.6959', 'grad_norm': '0.1312', 'learning_rate': '0.0001932', 'ppl': '2.006', 'memory/max_active (GiB)': '45.85', 'memory/max_allocated (GiB)': '45.85', 'memory/device_reserved (GiB)': '64.3', 'tokens/train_per_sec_per_gpu': '344.5', 'tokens/trainable': 786763, 'tokens/total': 1697704, 'epoch': '0.2098'} | |
| 21%|ββββββββββββββββββββββββββββββββ | 55/263 [40:31<2:41:57, 46.72s/it] 21%|ββββββββββββββββββββββββββββββββ | 56/263 [41:15<2:37:46, 45.73s/it] {'loss': '0.7052', 'grad_norm': '0.1675', 'learning_rate': '0.0001927', 'ppl': '2.024', 'memory/max_active (GiB)': '28.46', 'memory/max_allocated (GiB)': '28.46', 'memory/device_reserved (GiB)': '37.35', 'tokens/train_per_sec_per_gpu': '235.5', 'tokens/trainable': 796991, 'tokens/total': 1720524, 'epoch': '0.2136'} | |
| 21%|ββββββββββββββββββββββββββββββββ | 56/263 [41:15<2:37:46, 45.73s/it] 22%|βββββββββββββββββββββββββββββββββ | 57/263 [42:15<2:51:52, 50.06s/it] {'loss': '0.6921', 'grad_norm': '0.1401', 'learning_rate': '0.0001922', 'ppl': '1.998', 'memory/max_active (GiB)': '54.96', 'memory/max_allocated (GiB)': '54.96', 'memory/device_reserved (GiB)': '78.17', 'tokens/train_per_sec_per_gpu': '296.9', 'tokens/trainable': 814853, 'tokens/total': 1756640, 'epoch': '0.2175'} | |
| 22%|βββββββββββββββββββββββββββββββββ | 57/263 [42:15<2:51:52, 50.06s/it] 22%|βββββββββββββββββββββββββββββββββ | 58/263 [43:00<2:45:34, 48.46s/it] {'loss': '0.7537', 'grad_norm': '0.1416', 'learning_rate': '0.0001917', 'ppl': '2.125', 'memory/max_active (GiB)': '32.83', 'memory/max_allocated (GiB)': '32.83', 'memory/device_reserved (GiB)': '41.12', 'tokens/train_per_sec_per_gpu': '332.1', 'tokens/trainable': 829705, 'tokens/total': 1793570, 'epoch': '0.2213'} | |
| 22%|βββββββββββββββββββββββββββββββββ | 58/263 [43:00<2:45:34, 48.46s/it] 22%|ββββββββββββββββββββββββββββββββββ | 59/263 [43:42<2:38:03, 46.49s/it] {'loss': '0.731', 'grad_norm': '0.1516', 'learning_rate': '0.0001911', 'ppl': '2.077', 'memory/max_active (GiB)': '33.92', 'memory/max_allocated (GiB)': '33.92', 'memory/device_reserved (GiB)': '43.25', 'tokens/train_per_sec_per_gpu': '357.5', 'tokens/trainable': 844681, 'tokens/total': 1827586, 'epoch': '0.2251'} | |
| 22%|ββββββββββββββββββββββββββββββββββ | 59/263 [43:42<2:38:03, 46.49s/it] 23%|βββββββββββββββββββββββββββββββββββ | 60/263 [44:31<2:40:35, 47.46s/it] {'loss': '0.7322', 'grad_norm': '0.1597', 'learning_rate': '0.0001906', 'ppl': '2.08', 'memory/max_active (GiB)': '27.46', 'memory/max_allocated (GiB)': '27.46', 'memory/device_reserved (GiB)': '42.7', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 855924, 'tokens/total': 1853648, 'epoch': '0.2289'} | |
| 23%|βββββββββββββββββββββββββββββββββββ | 60/263 [44:31<2:40:35, 47.46s/it] 23%|βββββββββββββββββββββββββββββββββββ | 61/263 [45:19<2:39:37, 47.41s/it] {'loss': '0.7511', 'grad_norm': '0.1819', 'learning_rate': '0.00019', 'ppl': '2.119', 'memory/max_active (GiB)': '27.97', 'memory/max_allocated (GiB)': '27.97', 'memory/device_reserved (GiB)': '36.56', 'tokens/train_per_sec_per_gpu': '225.6', 'tokens/trainable': 866595, 'tokens/total': 1878090, 'epoch': '0.2327'} | |
| 23%|βββββββββββββββββββββββββββββββββββ | 61/263 [45:19<2:39:37, 47.41s/it] 24%|ββββββββββββββββββββββββββββββββββββ | 62/263 [46:01<2:33:28, 45.81s/it] {'loss': '0.6627', 'grad_norm': '0.1462', 'learning_rate': '0.0001894', 'ppl': '1.94', 'memory/max_active (GiB)': '38.92', 'memory/max_allocated (GiB)': '38.92', 'memory/device_reserved (GiB)': '53.8', 'tokens/train_per_sec_per_gpu': '373.5', 'tokens/trainable': 882309, 'tokens/total': 1913740, 'epoch': '0.2365'} | |
| 24%|ββββββββββββββββββββββββββββββββββββ | 62/263 [46:01<2:33:28, 45.81s/it] 24%|ββββββββββββββββββββββββββββββββββββ | 63/263 [47:16<3:01:53, 54.57s/it] {'loss': '0.6414', 'grad_norm': '0.1468', 'learning_rate': '0.0001888', 'ppl': '1.899', 'memory/max_active (GiB)': '56.13', 'memory/max_allocated (GiB)': '56.13', 'memory/device_reserved (GiB)': '73.46', 'tokens/train_per_sec_per_gpu': '332.8', 'tokens/trainable': 907269, 'tokens/total': 1971906, 'epoch': '0.2403'} | |
| 24%|ββββββββββββββββββββββββββββββββββββ | 63/263 [47:16<3:01:53, 54.57s/it] 24%|βββββββββββββββββββββββββββββββββββββ | 64/263 [47:56<2:46:47, 50.29s/it] {'loss': '0.7153', 'grad_norm': '0.2024', 'learning_rate': '0.0001882', 'ppl': '2.045', 'memory/max_active (GiB)': '53.82', 'memory/max_allocated (GiB)': '53.82', 'memory/device_reserved (GiB)': '76.49', 'tokens/train_per_sec_per_gpu': '322', 'tokens/trainable': 920250, 'tokens/total': 2002390, 'epoch': '0.2442'} | |
| 24%|βββββββββββββββββββββββββββββββββββββ | 64/263 [47:56<2:46:47, 50.29s/it] 25%|βββββββββββββββββββββββββββββββββββββ | 65/263 [48:30<2:29:57, 45.44s/it] {'loss': '0.7389', 'grad_norm': '0.1698', 'learning_rate': '0.0001876', 'ppl': '2.094', 'memory/max_active (GiB)': '29.14', 'memory/max_allocated (GiB)': '29.14', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '324.5', 'tokens/trainable': 931325, 'tokens/total': 2028270, 'epoch': '0.248'} | |
| 25%|βββββββββββββββββββββββββββββββββββββ | 65/263 [48:30<2:29:57, 45.44s/it] 25%|ββββββββββββββββββββββββββββββββββββββ | 66/263 [49:17<2:30:15, 45.76s/it] {'loss': '0.8371', 'grad_norm': '0.2279', 'learning_rate': '0.0001869', 'ppl': '2.31', 'memory/max_active (GiB)': '33.75', 'memory/max_allocated (GiB)': '33.75', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '214.5', 'tokens/trainable': 941301, 'tokens/total': 2051108, 'epoch': '0.2518'} | |
| 25%|ββββββββββββββββββββββββββββββββββββββ | 66/263 [49:17<2:30:15, 45.76s/it] 25%|βββββββββββββββββββββββββββββββββββββββ | 67/263 [50:05<2:31:52, 46.49s/it] {'loss': '0.705', 'grad_norm': '0.1503', 'learning_rate': '0.0001863', 'ppl': '2.024', 'memory/max_active (GiB)': '27.18', 'memory/max_allocated (GiB)': '27.18', 'memory/device_reserved (GiB)': '35.22', 'tokens/train_per_sec_per_gpu': '279', 'tokens/trainable': 954748, 'tokens/total': 2079644, 'epoch': '0.2556'} | |
| 25%|βββββββββββββββββββββββββββββββββββββββ | 67/263 [50:05<2:31:52, 46.49s/it] 26%|βββββββββββββββββββββββββββββββββββββββ | 68/263 [50:43<2:23:13, 44.07s/it] {'loss': '0.7131', 'grad_norm': '0.1581', 'learning_rate': '0.0001856', 'ppl': '2.04', 'memory/max_active (GiB)': '34.76', 'memory/max_allocated (GiB)': '34.76', 'memory/device_reserved (GiB)': '43.81', 'tokens/train_per_sec_per_gpu': '356.9', 'tokens/trainable': 968461, 'tokens/total': 2106514, 'epoch': '0.2594'} | |
| 26%|βββββββββββββββββββββββββββββββββββββββ | 68/263 [50:43<2:23:13, 44.07s/it] 26%|ββββββββββββββββββββββββββββββββββββββββ | 69/263 [51:16<2:11:22, 40.63s/it] {'loss': '0.7511', 'grad_norm': '0.1736', 'learning_rate': '0.0001849', 'ppl': '2.119', 'memory/max_active (GiB)': '27.61', 'memory/max_allocated (GiB)': '27.61', 'memory/device_reserved (GiB)': '36.19', 'tokens/train_per_sec_per_gpu': '401.6', 'tokens/trainable': 981554, 'tokens/total': 2133242, 'epoch': '0.2632'} | |
| 26%|ββββββββββββββββββββββββββββββββββββββββ | 69/263 [51:16<2:11:22, 40.63s/it] 27%|ββββββββββββββββββββββββββββββββββββββββ | 70/263 [51:47<2:00:59, 37.61s/it] {'loss': '0.7844', 'grad_norm': '0.1974', 'learning_rate': '0.0001842', 'ppl': '2.191', 'memory/max_active (GiB)': '22.87', 'memory/max_allocated (GiB)': '22.87', 'memory/device_reserved (GiB)': '35.35', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 990994, 'tokens/total': 2153354, 'epoch': '0.267'} | |
| 27%|ββββββββββββββββββββββββββββββββββββββββ | 70/263 [51:47<2:00:59, 37.61s/it] 27%|βββββββββββββββββββββββββββββββββββββββββ | 71/263 [52:43<2:18:20, 43.23s/it] {'loss': '0.6828', 'grad_norm': '0.2258', 'learning_rate': '0.0001835', 'ppl': '1.98', 'memory/max_active (GiB)': '39.9', 'memory/max_allocated (GiB)': '39.9', 'memory/device_reserved (GiB)': '55.24', 'tokens/train_per_sec_per_gpu': '225.3', 'tokens/trainable': 1003686, 'tokens/total': 2182806, 'epoch': '0.2709'} | |
| 27%|βββββββββββββββββββββββββββββββββββββββββ | 71/263 [52:43<2:18:20, 43.23s/it] 27%|βββββββββββββββββββββββββββββββββββββββββ | 72/263 [54:06<2:55:59, 55.29s/it] {'loss': '0.5772', 'grad_norm': '0.1597', 'learning_rate': '0.0001827', 'ppl': '1.781', 'memory/max_active (GiB)': '59.04', 'memory/max_allocated (GiB)': '59.04', 'memory/device_reserved (GiB)': '76.68', 'tokens/train_per_sec_per_gpu': '300.7', 'tokens/trainable': 1028766, 'tokens/total': 2241490, 'epoch': '0.2747'} | |
| 27%|βββββββββββββββββββββββββββββββββββββββββ | 72/263 [54:06<2:55:59, 55.29s/it] 28%|ββββββββββββββββββββββββββββββββββββββββββ | 73/263 [54:46<2:39:57, 50.51s/it] {'loss': '0.6909', 'grad_norm': '0.1693', 'learning_rate': '0.000182', 'ppl': '1.995', 'memory/max_active (GiB)': '42.9', 'memory/max_allocated (GiB)': '42.9', 'memory/device_reserved (GiB)': '59.81', 'tokens/train_per_sec_per_gpu': '349.6', 'tokens/trainable': 1042530, 'tokens/total': 2272086, 'epoch': '0.2785'} | |
| 28%|ββββββββββββββββββββββββββββββββββββββββββ | 73/263 [54:46<2:39:57, 50.51s/it] 28%|βββββββββββββββββββββββββββββββββββββββββββ | 74/263 [55:30<2:33:13, 48.65s/it] {'loss': '0.7108', 'grad_norm': '0.1959', 'learning_rate': '0.0001812', 'ppl': '2.036', 'memory/max_active (GiB)': '28.32', 'memory/max_allocated (GiB)': '28.32', 'memory/device_reserved (GiB)': '37.09', 'tokens/train_per_sec_per_gpu': '224', 'tokens/trainable': 1052450, 'tokens/total': 2294892, 'epoch': '0.2823'} | |
| 28%|βββββββββββββββββββββββββββββββββββββββββββ | 74/263 [55:30<2:33:13, 48.65s/it] 29%|βββββββββββββββββββββββββββββββββββββββββββ | 75/263 [56:22<2:35:57, 49.78s/it] {'loss': '0.7308', 'grad_norm': '0.1824', 'learning_rate': '0.0001804', 'ppl': '2.077', 'memory/max_active (GiB)': '25.12', 'memory/max_allocated (GiB)': '25.12', 'memory/device_reserved (GiB)': '32.62', 'tokens/train_per_sec_per_gpu': '252.4', 'tokens/trainable': 1065680, 'tokens/total': 2320714, 'epoch': '0.2861'} | |
| 29%|βββββββββββββββββββββββββββββββββββββββββββ | 75/263 [56:22<2:35:57, 49.78s/it] 29%|ββββββββββββββββββββββββββββββββββββββββββββ | 76/263 [57:06<2:29:33, 47.99s/it] {'loss': '0.6717', 'grad_norm': '0.1747', 'learning_rate': '0.0001796', 'ppl': '1.958', 'memory/max_active (GiB)': '23.01', 'memory/max_allocated (GiB)': '23.01', 'memory/device_reserved (GiB)': '29.15', 'tokens/train_per_sec_per_gpu': '217.3', 'tokens/trainable': 1075199, 'tokens/total': 2342076, 'epoch': '0.2899'} | |
| 29%|ββββββββββββββββββββββββββββββββββββββββββββ | 76/263 [57:06<2:29:33, 47.99s/it] 29%|ββββββββββββββββββββββββββββββββββββββββββββ | 77/263 [57:38<2:13:45, 43.15s/it] {'loss': '0.7283', 'grad_norm': '0.1765', 'learning_rate': '0.0001788', 'ppl': '2.071', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '353.1', 'tokens/trainable': 1086448, 'tokens/total': 2367826, 'epoch': '0.2938'} | |
| 29%|ββββββββββββββββββββββββββββββββββββββββββββ | 77/263 [57:38<2:13:45, 43.15s/it] 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 78/263 [58:24<2:15:13, 43.86s/it] {'loss': '0.6897', 'grad_norm': '0.133', 'learning_rate': '0.000178', 'ppl': '1.993', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.97', 'tokens/train_per_sec_per_gpu': '393.6', 'tokens/trainable': 1104361, 'tokens/total': 2406764, 'epoch': '0.2976'} | |
| 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 78/263 [58:24<2:15:13, 43.86s/it] 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 79/263 [59:16<2:22:50, 46.58s/it] {'loss': '0.726', 'grad_norm': '0.1792', 'learning_rate': '0.0001772', 'ppl': '2.067', 'memory/max_active (GiB)': '40.91', 'memory/max_allocated (GiB)': '40.91', 'memory/device_reserved (GiB)': '56.76', 'tokens/train_per_sec_per_gpu': '245.3', 'tokens/trainable': 1117342, 'tokens/total': 2434766, 'epoch': '0.3014'} | |
| 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 79/263 [59:16<2:22:50, 46.58s/it] 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 80/263 [1:00:12<2:30:27, 49.33s/it] {'loss': '0.7226', 'grad_norm': '0.1773', 'learning_rate': '0.0001763', 'ppl': '2.06', 'memory/max_active (GiB)': '24.17', 'memory/max_allocated (GiB)': '24.17', 'memory/device_reserved (GiB)': '30.76', 'tokens/train_per_sec_per_gpu': '201.2', 'tokens/trainable': 1128559, 'tokens/total': 2459278, 'epoch': '0.3052'} | |
| 30%|βββββββββββββββββββββββββββββββββββββββββββββ | 80/263 [1:00:12<2:30:27, 49.33s/it] 31%|ββββββββββββββββββββββββββββββββββββββββββββββ | 81/263 [1:00:56<2:24:41, 47.70s/it] {'loss': '0.687', 'grad_norm': '0.2009', 'learning_rate': '0.0001755', 'ppl': '1.988', 'memory/max_active (GiB)': '25.85', 'memory/max_allocated (GiB)': '25.85', 'memory/device_reserved (GiB)': '33.68', 'tokens/train_per_sec_per_gpu': '252', 'tokens/trainable': 1139622, 'tokens/total': 2482288, 'epoch': '0.309'} | |
| 31%|ββββββββββββββββββββββββββββββββββββββββββββββ | 81/263 [1:00:56<2:24:41, 47.70s/it] 31%|βββββββββββββββββββββββββββββββββββββββββββββββ | 82/263 [1:01:30<2:11:37, 43.64s/it] {'loss': '0.6404', 'grad_norm': '0.1794', 'learning_rate': '0.0001746', 'ppl': '1.897', 'memory/max_active (GiB)': '34.83', 'memory/max_allocated (GiB)': '34.83', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '332.6', 'tokens/trainable': 1150978, 'tokens/total': 2508102, 'epoch': '0.3128'} | |
| 31%|βββββββββββββββββββββββββββββββββββββββββββββββ | 82/263 [1:01:30<2:11:37, 43.64s/it] 32%|βββββββββββββββββββββββββββββββββββββββββββββββ | 83/263 [1:02:17<2:13:57, 44.65s/it] {'loss': '0.6434', 'grad_norm': '0.1725', 'learning_rate': '0.0001737', 'ppl': '1.903', 'memory/max_active (GiB)': '34.96', 'memory/max_allocated (GiB)': '34.96', 'memory/device_reserved (GiB)': '43.95', 'tokens/train_per_sec_per_gpu': '264.2', 'tokens/trainable': 1163404, 'tokens/total': 2534552, 'epoch': '0.3166'} | |
| 32%|βββββββββββββββββββββββββββββββββββββββββββββββ | 83/263 [1:02:17<2:13:57, 44.65s/it] 32%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 84/263 [1:03:06<2:17:12, 45.99s/it] {'loss': '0.6432', 'grad_norm': '0.144', 'learning_rate': '0.0001728', 'ppl': '1.903', 'memory/max_active (GiB)': '24.85', 'memory/max_allocated (GiB)': '24.85', 'memory/device_reserved (GiB)': '32.01', 'tokens/train_per_sec_per_gpu': '300', 'tokens/trainable': 1178135, 'tokens/total': 2562982, 'epoch': '0.3205'} | |
| 32%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 84/263 [1:03:06<2:17:12, 45.99s/it] 32%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 85/263 [1:03:53<2:16:37, 46.05s/it] {'loss': '0.7349', 'grad_norm': '0.2154', 'learning_rate': '0.0001719', 'ppl': '2.085', 'memory/max_active (GiB)': '25.97', 'memory/max_allocated (GiB)': '25.97', 'memory/device_reserved (GiB)': '33.5', 'tokens/train_per_sec_per_gpu': '201.4', 'tokens/trainable': 1187438, 'tokens/total': 2585130, 'epoch': '0.3243'} | |
| 32%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 85/263 [1:03:53<2:16:37, 46.05s/it] 33%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 86/263 [1:04:34<2:11:44, 44.66s/it] {'loss': '0.667', 'grad_norm': '0.1647', 'learning_rate': '0.0001709', 'ppl': '1.948', 'memory/max_active (GiB)': '32.35', 'memory/max_allocated (GiB)': '32.35', 'memory/device_reserved (GiB)': '35.79', 'tokens/train_per_sec_per_gpu': '308.1', 'tokens/trainable': 1200197, 'tokens/total': 2610804, 'epoch': '0.3281'} | |
| 33%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 86/263 [1:04:34<2:11:44, 44.66s/it] 33%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 87/263 [1:05:11<2:03:50, 42.22s/it] {'loss': '0.7045', 'grad_norm': '0.1813', 'learning_rate': '0.00017', 'ppl': '2.023', 'memory/max_active (GiB)': '34.17', 'memory/max_allocated (GiB)': '34.17', 'memory/device_reserved (GiB)': '42.85', 'tokens/train_per_sec_per_gpu': '308.7', 'tokens/trainable': 1211474, 'tokens/total': 2635758, 'epoch': '0.3319'} | |
| 33%|βββββββββββββββββββββββββββββββββββββββββββββββββ | 87/263 [1:05:11<2:03:50, 42.22s/it] 33%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 88/263 [1:06:03<2:12:03, 45.28s/it] {'loss': '0.7118', 'grad_norm': '0.1373', 'learning_rate': '0.0001691', 'ppl': '2.038', 'memory/max_active (GiB)': '49.54', 'memory/max_allocated (GiB)': '49.54', 'memory/device_reserved (GiB)': '69.87', 'tokens/train_per_sec_per_gpu': '390', 'tokens/trainable': 1231915, 'tokens/total': 2679352, 'epoch': '0.3357'} | |
| 33%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 88/263 [1:06:03<2:12:03, 45.28s/it] 34%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 89/263 [1:06:36<2:00:21, 41.50s/it] {'loss': '0.7212', 'grad_norm': '0.1959', 'learning_rate': '0.0001681', 'ppl': '2.057', 'memory/max_active (GiB)': '26.55', 'memory/max_allocated (GiB)': '26.55', 'memory/device_reserved (GiB)': '38.62', 'tokens/train_per_sec_per_gpu': '346.5', 'tokens/trainable': 1243243, 'tokens/total': 2703340, 'epoch': '0.3395'} | |
| 34%|ββββββββββββββββββββββββββββββββββββββββββββββββββ | 89/263 [1:06:36<2:00:21, 41.50s/it] 34%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 90/263 [1:07:06<1:50:05, 38.18s/it] {'loss': '0.7258', 'grad_norm': '0.2195', 'learning_rate': '0.0001671', 'ppl': '2.066', 'memory/max_active (GiB)': '30.13', 'memory/max_allocated (GiB)': '30.13', 'memory/device_reserved (GiB)': '39.89', 'tokens/train_per_sec_per_gpu': '316.4', 'tokens/trainable': 1252869, 'tokens/total': 2726904, 'epoch': '0.3433'} | |
| 34%|βββββββββββββββββββββββββββββββββββββββββββββββββββ | 90/263 [1:07:06<1:50:05, 38.18s/it] 35%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 91/263 [1:07:50<1:54:23, 39.90s/it] {'loss': '0.7356', 'grad_norm': '0.1998', 'learning_rate': '0.0001661', 'ppl': '2.087', 'memory/max_active (GiB)': '36.96', 'memory/max_allocated (GiB)': '36.96', 'memory/device_reserved (GiB)': '46.66', 'tokens/train_per_sec_per_gpu': '350.6', 'tokens/trainable': 1268267, 'tokens/total': 2760058, 'epoch': '0.3472'} | |
| 35%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 91/263 [1:07:50<1:54:23, 39.90s/it] 35%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 92/263 [1:08:35<1:57:53, 41.37s/it] {'loss': '0.6534', 'grad_norm': '0.1845', 'learning_rate': '0.0001651', 'ppl': '1.922', 'memory/max_active (GiB)': '35.55', 'memory/max_allocated (GiB)': '35.55', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '339.5', 'tokens/trainable': 1283469, 'tokens/total': 2794120, 'epoch': '0.351'} | |
| 35%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ | 92/263 [1:08:35<1:57:53, 41.37s/it] 35%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 93/263 [1:09:07<1:49:00, 38.47s/it] {'loss': '0.7409', 'grad_norm': '0.1926', 'learning_rate': '0.0001641', 'ppl': '2.098', 'memory/max_active (GiB)': '23.77', 'memory/max_allocated (GiB)': '23.77', 'memory/device_reserved (GiB)': '44.75', 'tokens/train_per_sec_per_gpu': '334.8', 'tokens/trainable': 1294090, 'tokens/total': 2818588, 'epoch': '0.3548'} | |
| 35%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 93/263 [1:09:07<1:49:00, 38.47s/it] 36%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 94/263 [1:09:49<1:51:37, 39.63s/it] {'loss': '0.6379', 'grad_norm': '0.1578', 'learning_rate': '0.0001631', 'ppl': '1.892', 'memory/max_active (GiB)': '34.71', 'memory/max_allocated (GiB)': '34.71', 'memory/device_reserved (GiB)': '43.64', 'tokens/train_per_sec_per_gpu': '369.7', 'tokens/trainable': 1309742, 'tokens/total': 2851686, 'epoch': '0.3586'} | |
| 36%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 94/263 [1:09:49<1:51:37, 39.63s/it] 36%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 95/263 [1:10:26<1:49:15, 39.02s/it] {'loss': '0.7076', 'grad_norm': '0.1786', 'learning_rate': '0.0001621', 'ppl': '2.029', 'memory/max_active (GiB)': '33.53', 'memory/max_allocated (GiB)': '33.53', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '353.4', 'tokens/trainable': 1323027, 'tokens/total': 2880764, 'epoch': '0.3624'} | |
| 36%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 95/263 [1:10:26<1:49:15, 39.02s/it] 37%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 96/263 [1:10:59<1:42:48, 36.94s/it] {'loss': '0.6912', 'grad_norm': '0.1893', 'learning_rate': '0.000161', 'ppl': '1.996', 'memory/max_active (GiB)': '35.45', 'memory/max_allocated (GiB)': '35.45', 'memory/device_reserved (GiB)': '44.59', 'tokens/train_per_sec_per_gpu': '352.2', 'tokens/trainable': 1334326, 'tokens/total': 2905950, 'epoch': '0.3662'} | |
| 37%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 96/263 [1:10:59<1:42:48, 36.94s/it] 37%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 97/263 [1:11:43<1:48:20, 39.16s/it] {'loss': '0.6708', 'grad_norm': '0.173', 'learning_rate': '0.00016', 'ppl': '1.956', 'memory/max_active (GiB)': '47.2', 'memory/max_allocated (GiB)': '47.2', 'memory/device_reserved (GiB)': '66.49', 'tokens/train_per_sec_per_gpu': '351.1', 'tokens/trainable': 1349894, 'tokens/total': 2942380, 'epoch': '0.3701'} | |
| 37%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 97/263 [1:11:43<1:48:20, 39.16s/it] 37%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 98/263 [1:12:38<2:00:32, 43.83s/it] {'loss': '0.685', 'grad_norm': '0.1576', 'learning_rate': '0.0001589', 'ppl': '1.984', 'memory/max_active (GiB)': '37.19', 'memory/max_allocated (GiB)': '37.19', 'memory/device_reserved (GiB)': '47.07', 'tokens/train_per_sec_per_gpu': '357.9', 'tokens/trainable': 1369481, 'tokens/total': 2988904, 'epoch': '0.3739'} | |
| 37%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 98/263 [1:12:38<2:00:32, 43.83s/it] 38%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 99/263 [1:13:21<1:59:39, 43.77s/it] {'loss': '0.6715', 'grad_norm': '0.1494', 'learning_rate': '0.0001578', 'ppl': '1.957', 'memory/max_active (GiB)': '42.43', 'memory/max_allocated (GiB)': '42.43', 'memory/device_reserved (GiB)': '59.1', 'tokens/train_per_sec_per_gpu': '384.8', 'tokens/trainable': 1386274, 'tokens/total': 3026814, 'epoch': '0.3777'} | |
| 38%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 99/263 [1:13:21<1:59:39, 43.77s/it] 38%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 100/263 [1:13:55<1:51:09, 40.92s/it] {'loss': '0.6481', 'grad_norm': '0.208', 'learning_rate': '0.0001567', 'ppl': '1.912', 'memory/max_active (GiB)': '26.51', 'memory/max_allocated (GiB)': '26.51', 'memory/device_reserved (GiB)': '34.38', 'tokens/train_per_sec_per_gpu': '296.5', 'tokens/trainable': 1396430, 'tokens/total': 3050820, 'epoch': '0.3815'} | |
| 38%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 100/263 [1:13:55<1:51:09, 40.92s/it] 38%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 101/263 [1:14:53<2:04:04, 45.96s/it] {'loss': '0.7205', 'grad_norm': '0.1546', 'learning_rate': '0.0001556', 'ppl': '2.056', 'memory/max_active (GiB)': '44.95', 'memory/max_allocated (GiB)': '44.95', 'memory/device_reserved (GiB)': '62.88', 'tokens/train_per_sec_per_gpu': '350', 'tokens/trainable': 1416631, 'tokens/total': 3099000, 'epoch': '0.3853'} | |
| 38%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 101/263 [1:14:53<2:04:04, 45.96s/it] 39%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 102/263 [1:15:38<2:02:20, 45.60s/it] {'loss': '0.7284', 'grad_norm': '0.1686', 'learning_rate': '0.0001545', 'ppl': '2.072', 'memory/max_active (GiB)': '32', 'memory/max_allocated (GiB)': '32', 'memory/device_reserved (GiB)': '42.76', 'tokens/train_per_sec_per_gpu': '367.9', 'tokens/trainable': 1433096, 'tokens/total': 3134960, 'epoch': '0.3891'} | |
| 39%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 102/263 [1:15:38<2:02:20, 45.60s/it] 39%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 103/263 [1:16:06<1:47:33, 40.33s/it] {'loss': '0.6715', 'grad_norm': '0.1936', 'learning_rate': '0.0001534', 'ppl': '1.957', 'memory/max_active (GiB)': '26.34', 'memory/max_allocated (GiB)': '26.34', 'memory/device_reserved (GiB)': '34.07', 'tokens/train_per_sec_per_gpu': '351.9', 'tokens/trainable': 1442967, 'tokens/total': 3156628, 'epoch': '0.3929'} | |
| 39%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 103/263 [1:16:06<1:47:33, 40.33s/it] 40%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 104/263 [1:16:41<1:42:26, 38.66s/it] {'loss': '0.5962', 'grad_norm': '0.1888', 'learning_rate': '0.0001523', 'ppl': '1.815', 'memory/max_active (GiB)': '28.66', 'memory/max_allocated (GiB)': '28.66', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 1453178, 'tokens/total': 3180850, 'epoch': '0.3968'} | |
| 40%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 104/263 [1:16:41<1:42:26, 38.66s/it] 40%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 105/263 [1:17:11<1:35:30, 36.27s/it] {'loss': '0.6637', 'grad_norm': '0.1802', 'learning_rate': '0.0001511', 'ppl': '1.942', 'memory/max_active (GiB)': '28.04', 'memory/max_allocated (GiB)': '28.04', 'memory/device_reserved (GiB)': '37.52', 'tokens/train_per_sec_per_gpu': '375', 'tokens/trainable': 1464686, 'tokens/total': 3203888, 'epoch': '0.4006'} | |
| 40%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 105/263 [1:17:11<1:35:30, 36.27s/it] 40%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 106/263 [1:17:42<1:30:16, 34.50s/it] {'loss': '0.7321', 'grad_norm': '0.2234', 'learning_rate': '0.00015', 'ppl': '2.079', 'memory/max_active (GiB)': '21.56', 'memory/max_allocated (GiB)': '21.56', 'memory/device_reserved (GiB)': '28.04', 'tokens/train_per_sec_per_gpu': '313.8', 'tokens/trainable': 1474221, 'tokens/total': 3224060, 'epoch': '0.4044'} | |
| 40%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 106/263 [1:17:42<1:30:16, 34.50s/it] 41%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 107/263 [1:18:34<1:43:12, 39.70s/it] {'loss': '0.5681', 'grad_norm': '0.162', 'learning_rate': '0.0001488', 'ppl': '1.765', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '53.23', 'tokens/train_per_sec_per_gpu': '297.5', 'tokens/trainable': 1489638, 'tokens/total': 3260576, 'epoch': '0.4082'} | |
| 41%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 107/263 [1:18:34<1:43:12, 39.70s/it] 41%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 108/263 [1:19:20<1:47:51, 41.75s/it] {'loss': '0.708', 'grad_norm': '0.197', 'learning_rate': '0.0001477', 'ppl': '2.03', 'memory/max_active (GiB)': '29.18', 'memory/max_allocated (GiB)': '29.18', 'memory/device_reserved (GiB)': '38.48', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 1502425, 'tokens/total': 3287854, 'epoch': '0.412'} | |
| 41%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 108/263 [1:19:20<1:47:51, 41.75s/it] 41%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 109/263 [1:20:18<1:59:25, 46.53s/it] {'loss': '0.6689', 'grad_norm': '0.1784', 'learning_rate': '0.0001465', 'ppl': '1.952', 'memory/max_active (GiB)': '38.9', 'memory/max_allocated (GiB)': '38.9', 'memory/device_reserved (GiB)': '53.77', 'tokens/train_per_sec_per_gpu': '320.6', 'tokens/trainable': 1520916, 'tokens/total': 3326198, 'epoch': '0.4158'} | |
| 41%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 109/263 [1:20:18<1:59:25, 46.53s/it] 42%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 110/263 [1:21:01<1:56:00, 45.49s/it] {'loss': '0.6355', 'grad_norm': '0.1931', 'learning_rate': '0.0001453', 'ppl': '1.888', 'memory/max_active (GiB)': '28.51', 'memory/max_allocated (GiB)': '28.51', 'memory/device_reserved (GiB)': '43.21', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 1532644, 'tokens/total': 3353708, 'epoch': '0.4196'} | |
| 42%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 110/263 [1:21:01<1:56:00, 45.49s/it] 42%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 111/263 [1:21:37<1:48:21, 42.77s/it] {'loss': '0.6837', 'grad_norm': '0.1755', 'learning_rate': '0.0001442', 'ppl': '1.981', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.12', 'tokens/train_per_sec_per_gpu': '395.7', 'tokens/trainable': 1547058, 'tokens/total': 3383196, 'epoch': '0.4235'} | |
| 42%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 111/263 [1:21:37<1:48:21, 42.77s/it] 43%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 112/263 [1:22:31<1:55:30, 45.90s/it] {'loss': '0.6215', 'grad_norm': '0.2103', 'learning_rate': '0.000143', 'ppl': '1.862', 'memory/max_active (GiB)': '40.6', 'memory/max_allocated (GiB)': '40.6', 'memory/device_reserved (GiB)': '56.43', 'tokens/train_per_sec_per_gpu': '248.4', 'tokens/trainable': 1560272, 'tokens/total': 3415688, 'epoch': '0.4273'} | |
| 43%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 112/263 [1:22:31<1:55:30, 45.90s/it] 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 113/263 [1:23:20<1:57:39, 47.06s/it] {'loss': '0.6699', 'grad_norm': '0.1813', 'learning_rate': '0.0001418', 'ppl': '1.954', 'memory/max_active (GiB)': '37.04', 'memory/max_allocated (GiB)': '37.04', 'memory/device_reserved (GiB)': '46.8', 'tokens/train_per_sec_per_gpu': '282.5', 'tokens/trainable': 1574335, 'tokens/total': 3444166, 'epoch': '0.4311'} | |
| 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 113/263 [1:23:20<1:57:39, 47.06s/it] 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 114/263 [1:24:12<2:00:02, 48.34s/it] {'loss': '0.594', 'grad_norm': '0.1998', 'learning_rate': '0.0001406', 'ppl': '1.811', 'memory/max_active (GiB)': '36.6', 'memory/max_allocated (GiB)': '36.6', 'memory/device_reserved (GiB)': '46.27', 'tokens/train_per_sec_per_gpu': '276.6', 'tokens/trainable': 1588531, 'tokens/total': 3472530, 'epoch': '0.4349'} | |
| 43%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 114/263 [1:24:12<2:00:02, 48.34s/it] 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 115/263 [1:25:04<2:01:51, 49.40s/it] {'loss': '0.7205', 'grad_norm': '0.1913', 'learning_rate': '0.0001393', 'ppl': '2.055', 'memory/max_active (GiB)': '37.17', 'memory/max_allocated (GiB)': '37.17', 'memory/device_reserved (GiB)': '47.05', 'tokens/train_per_sec_per_gpu': '279.3', 'tokens/trainable': 1603020, 'tokens/total': 3503644, 'epoch': '0.4387'} | |
| 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 115/263 [1:25:04<2:01:51, 49.40s/it] 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 116/263 [1:25:48<1:57:32, 47.97s/it] {'loss': '0.7426', 'grad_norm': '0.2782', 'learning_rate': '0.0001381', 'ppl': '2.101', 'memory/max_active (GiB)': '40', 'memory/max_allocated (GiB)': '40', 'memory/device_reserved (GiB)': '55.41', 'tokens/train_per_sec_per_gpu': '231.5', 'tokens/trainable': 1613354, 'tokens/total': 3529882, 'epoch': '0.4425'} | |
| 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 116/263 [1:25:48<1:57:32, 47.97s/it] 44%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 117/263 [1:26:52<2:08:08, 52.66s/it] {'loss': '0.6484', 'grad_norm': '0.1659', 'learning_rate': '0.0001369', 'ppl': '1.913', 'memory/max_active (GiB)': '54.45', 'memory/max_allocated (GiB)': '54.45', 'memory/device_reserved (GiB)': '77.33', 'tokens/train_per_sec_per_gpu': '272', 'tokens/trainable': 1630658, 'tokens/total': 3566156, 'epoch': '0.4464'} | |
| 44%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 117/263 [1:26:52<2:08:08, 52.66s/it] 45%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 118/263 [1:27:52<2:12:36, 54.87s/it] {'loss': '0.611', 'grad_norm': '0.1486', 'learning_rate': '0.0001357', 'ppl': '1.842', 'memory/max_active (GiB)': '43.75', 'memory/max_allocated (GiB)': '43.75', 'memory/device_reserved (GiB)': '61.14', 'tokens/train_per_sec_per_gpu': '329.4', 'tokens/trainable': 1650425, 'tokens/total': 3606182, 'epoch': '0.4502'} | |
| 45%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 118/263 [1:27:52<2:12:36, 54.87s/it] 45%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 119/263 [1:28:50<2:13:58, 55.82s/it] {'loss': '0.6299', 'grad_norm': '0.1702', 'learning_rate': '0.0001344', 'ppl': '1.877', 'memory/max_active (GiB)': '60.94', 'memory/max_allocated (GiB)': '60.94', 'memory/device_reserved (GiB)': '69.08', 'tokens/train_per_sec_per_gpu': '289.6', 'tokens/trainable': 1667235, 'tokens/total': 3645356, 'epoch': '0.454'} | |
| 45%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 119/263 [1:28:50<2:13:58, 55.82s/it] 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 120/263 [1:29:35<2:05:41, 52.74s/it] {'loss': '0.7218', 'grad_norm': '0.1997', 'learning_rate': '0.0001332', 'ppl': '2.058', 'memory/max_active (GiB)': '37.26', 'memory/max_allocated (GiB)': '37.26', 'memory/device_reserved (GiB)': '47', 'tokens/train_per_sec_per_gpu': '310.7', 'tokens/trainable': 1681386, 'tokens/total': 3674268, 'epoch': '0.4578'} | |
| 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 120/263 [1:29:35<2:05:41, 52.74s/it] 46%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 121/263 [1:30:33<2:08:14, 54.19s/it] {'loss': '0.6603', 'grad_norm': '0.2087', 'learning_rate': '0.0001319', 'ppl': '1.935', 'memory/max_active (GiB)': '42.12', 'memory/max_allocated (GiB)': '42.12', 'memory/device_reserved (GiB)': '58.63', 'tokens/train_per_sec_per_gpu': '268.9', 'tokens/trainable': 1696869, 'tokens/total': 3710714, 'epoch': '0.4616'} | |
| 46%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 121/263 [1:30:33<2:08:14, 54.19s/it] 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 122/263 [1:31:17<2:00:10, 51.14s/it] {'loss': '0.7366', 'grad_norm': '0.2266', 'learning_rate': '0.0001306', 'ppl': '2.089', 'memory/max_active (GiB)': '23.04', 'memory/max_allocated (GiB)': '23.04', 'memory/device_reserved (GiB)': '45.24', 'tokens/train_per_sec_per_gpu': '206.1', 'tokens/trainable': 1705941, 'tokens/total': 3730260, 'epoch': '0.4654'} | |
| 46%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 122/263 [1:31:17<2:00:10, 51.14s/it] 47%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 123/263 [1:32:05<1:57:21, 50.29s/it] {'loss': '0.7187', 'grad_norm': '0.164', 'learning_rate': '0.0001294', 'ppl': '2.052', 'memory/max_active (GiB)': '31.57', 'memory/max_allocated (GiB)': '31.57', 'memory/device_reserved (GiB)': '41.88', 'tokens/train_per_sec_per_gpu': '350.2', 'tokens/trainable': 1722869, 'tokens/total': 3764600, 'epoch': '0.4692'} | |
| 47%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 123/263 [1:32:05<1:57:21, 50.29s/it] 47%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 124/263 [1:32:54<1:55:10, 49.72s/it] {'loss': '0.7666', 'grad_norm': '0.2024', 'learning_rate': '0.0001281', 'ppl': '2.152', 'memory/max_active (GiB)': '28.74', 'memory/max_allocated (GiB)': '28.74', 'memory/device_reserved (GiB)': '37.72', 'tokens/train_per_sec_per_gpu': '259.1', 'tokens/trainable': 1735401, 'tokens/total': 3789866, 'epoch': '0.4731'} | |
| 47%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 124/263 [1:32:54<1:55:10, 49.72s/it] 48%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 125/263 [1:33:56<2:03:06, 53.52s/it] {'loss': '0.6747', 'grad_norm': '0.1734', 'learning_rate': '0.0001268', 'ppl': '1.963', 'memory/max_active (GiB)': '36.35', 'memory/max_allocated (GiB)': '36.35', 'memory/device_reserved (GiB)': '45.94', 'tokens/train_per_sec_per_gpu': '266.3', 'tokens/trainable': 1752018, 'tokens/total': 3826538, 'epoch': '0.4769'} | |
| 48%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 125/263 [1:33:56<2:03:06, 53.52s/it] 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 126/263 [1:34:58<2:07:43, 55.94s/it] {'loss': '0.7026', 'grad_norm': '0.1709', 'learning_rate': '0.0001256', 'ppl': '2.019', 'memory/max_active (GiB)': '34.18', 'memory/max_allocated (GiB)': '34.18', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '257.9', 'tokens/trainable': 1767897, 'tokens/total': 3861520, 'epoch': '0.4807'} | |
| 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 126/263 [1:34:58<2:07:43, 55.94s/it] 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 127/263 [1:35:43<1:59:53, 52.89s/it] {'loss': '0.7321', 'grad_norm': '0.2378', 'learning_rate': '0.0001243', 'ppl': '2.079', 'memory/max_active (GiB)': '26.64', 'memory/max_allocated (GiB)': '26.64', 'memory/device_reserved (GiB)': '42.92', 'tokens/train_per_sec_per_gpu': '211', 'tokens/trainable': 1777560, 'tokens/total': 3883630, 'epoch': '0.4845'} | |
| 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 127/263 [1:35:43<1:59:53, 52.89s/it] 49%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 128/263 [1:36:21<1:48:48, 48.36s/it] {'loss': '0.6494', 'grad_norm': '0.1878', 'learning_rate': '0.000123', 'ppl': '1.914', 'memory/max_active (GiB)': '23.99', 'memory/max_allocated (GiB)': '23.99', 'memory/device_reserved (GiB)': '31.15', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 1789026, 'tokens/total': 3907826, 'epoch': '0.4883'} | |
| 49%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 128/263 [1:36:21<1:48:48, 48.36s/it] 49%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 129/263 [1:37:18<1:53:46, 50.95s/it] {'loss': '0.6286', 'grad_norm': '0.1865', 'learning_rate': '0.0001217', 'ppl': '1.875', 'memory/max_active (GiB)': '34.03', 'memory/max_allocated (GiB)': '34.03', 'memory/device_reserved (GiB)': '42.72', 'tokens/train_per_sec_per_gpu': '281.1', 'tokens/trainable': 1805039, 'tokens/total': 3942120, 'epoch': '0.4921'} | |
| 49%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 129/263 [1:37:18<1:53:46, 50.95s/it] 49%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 130/263 [1:38:19<1:59:46, 54.03s/it] {'loss': '0.5905', 'grad_norm': '0.1831', 'learning_rate': '0.0001204', 'ppl': '1.805', 'memory/max_active (GiB)': '40.32', 'memory/max_allocated (GiB)': '40.32', 'memory/device_reserved (GiB)': '55.9', 'tokens/train_per_sec_per_gpu': '282.4', 'tokens/trainable': 1822330, 'tokens/total': 3976238, 'epoch': '0.4959'} | |
| 49%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 130/263 [1:38:19<1:59:46, 54.03s/it] 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 131/263 [1:39:03<1:52:00, 50.91s/it] {'loss': '0.6443', 'grad_norm': '0.1812', 'learning_rate': '0.0001191', 'ppl': '1.905', 'memory/max_active (GiB)': '33.09', 'memory/max_allocated (GiB)': '33.09', 'memory/device_reserved (GiB)': '41.43', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 1835502, 'tokens/total': 4004024, 'epoch': '0.4998'} | |
| 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 131/263 [1:39:03<1:52:00, 50.91s/it] 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 132/263 [1:40:07<1:59:46, 54.86s/it] {'loss': '0.6177', 'grad_norm': '0.158', 'learning_rate': '0.0001178', 'ppl': '1.855', 'memory/max_active (GiB)': '44.4', 'memory/max_allocated (GiB)': '44.4', 'memory/device_reserved (GiB)': '62.21', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 1854705, 'tokens/total': 4047366, 'epoch': '0.5036'} | |
| 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 132/263 [1:40:07<1:59:46, 54.86s/it] 51%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 133/263 [1:41:01<1:58:27, 54.67s/it] {'loss': '0.6597', 'grad_norm': '0.197', 'learning_rate': '0.0001165', 'ppl': '1.934', 'memory/max_active (GiB)': '38.42', 'memory/max_allocated (GiB)': '38.42', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '225.4', 'tokens/trainable': 1866931, 'tokens/total': 4073028, 'epoch': '0.5074'} | |
| 51%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 133/263 [1:41:01<1:58:27, 54.67s/it] 51%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 134/263 [1:42:00<2:00:25, 56.01s/it] {'loss': '0.6264', 'grad_norm': '0.1803', 'learning_rate': '0.0001152', 'ppl': '1.871', 'memory/max_active (GiB)': '35.15', 'memory/max_allocated (GiB)': '35.15', 'memory/device_reserved (GiB)': '48.52', 'tokens/train_per_sec_per_gpu': '217.6', 'tokens/trainable': 1879795, 'tokens/total': 4103326, 'epoch': '0.5112'} | |
| 51%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 134/263 [1:42:01<2:00:25, 56.01s/it] 51%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 135/263 [1:43:08<2:06:45, 59.42s/it] {'loss': '0.5894', 'grad_norm': '0.1796', 'learning_rate': '0.0001139', 'ppl': '1.803', 'memory/max_active (GiB)': '60.2', 'memory/max_allocated (GiB)': '60.2', 'memory/device_reserved (GiB)': '78.05', 'tokens/train_per_sec_per_gpu': '280.6', 'tokens/trainable': 1898706, 'tokens/total': 4145556, 'epoch': '0.515'} | |
| 51%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 135/263 [1:43:08<2:06:45, 59.42s/it] 52%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 136/263 [1:44:02<2:02:32, 57.90s/it] {'loss': '0.6652', 'grad_norm': '0.1649', 'learning_rate': '0.0001126', 'ppl': '1.945', 'memory/max_active (GiB)': '36.24', 'memory/max_allocated (GiB)': '36.24', 'memory/device_reserved (GiB)': '45.69', 'tokens/train_per_sec_per_gpu': '253.7', 'tokens/trainable': 1912491, 'tokens/total': 4178912, 'epoch': '0.5188'} | |
| 52%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 136/263 [1:44:02<2:02:32, 57.90s/it] 52%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 137/263 [1:45:08<2:06:33, 60.27s/it] {'loss': '0.6049', 'grad_norm': '0.1469', 'learning_rate': '0.0001112', 'ppl': '1.831', 'memory/max_active (GiB)': '54.11', 'memory/max_allocated (GiB)': '54.11', 'memory/device_reserved (GiB)': '77.07', 'tokens/train_per_sec_per_gpu': '291.8', 'tokens/trainable': 1931691, 'tokens/total': 4219728, 'epoch': '0.5227'} | |
| 52%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 137/263 [1:45:08<2:06:33, 60.27s/it] 52%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 138/263 [1:45:55<1:57:30, 56.40s/it] {'loss': '0.7114', 'grad_norm': '0.219', 'learning_rate': '0.0001099', 'ppl': '2.037', 'memory/max_active (GiB)': '29.78', 'memory/max_allocated (GiB)': '29.78', 'memory/device_reserved (GiB)': '39.28', 'tokens/train_per_sec_per_gpu': '230.8', 'tokens/trainable': 1942627, 'tokens/total': 4246086, 'epoch': '0.5265'} | |
| 52%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 138/263 [1:45:55<1:57:30, 56.40s/it] 53%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 139/263 [1:46:33<1:44:54, 50.76s/it] {'loss': '0.6658', 'grad_norm': '0.2107', 'learning_rate': '0.0001086', 'ppl': '1.946', 'memory/max_active (GiB)': '32.15', 'memory/max_allocated (GiB)': '32.15', 'memory/device_reserved (GiB)': '35.63', 'tokens/train_per_sec_per_gpu': '371.6', 'tokens/trainable': 1956598, 'tokens/total': 4274466, 'epoch': '0.5303'} | |
| 53%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 139/263 [1:46:33<1:44:54, 50.76s/it] 53%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 140/263 [1:47:10<1:35:26, 46.56s/it] {'loss': '0.6586', 'grad_norm': '0.1947', 'learning_rate': '0.0001073', 'ppl': '1.932', 'memory/max_active (GiB)': '29.82', 'memory/max_allocated (GiB)': '29.82', 'memory/device_reserved (GiB)': '39.44', 'tokens/train_per_sec_per_gpu': '341.7', 'tokens/trainable': 1969155, 'tokens/total': 4302324, 'epoch': '0.5341'} | |
| 53%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 140/263 [1:47:10<1:35:26, 46.56s/it] 54%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 141/263 [1:47:52<1:32:13, 45.36s/it] {'loss': '0.7005', 'grad_norm': '0.2028', 'learning_rate': '0.000106', 'ppl': '2.015', 'memory/max_active (GiB)': '30.7', 'memory/max_allocated (GiB)': '30.7', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '272.9', 'tokens/trainable': 1980769, 'tokens/total': 4330164, 'epoch': '0.5379'} | |
| 54%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 141/263 [1:47:52<1:32:13, 45.36s/it] 54%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 142/263 [1:48:40<1:33:08, 46.19s/it] {'loss': '0.6936', 'grad_norm': '0.207', 'learning_rate': '0.0001046', 'ppl': '2.001', 'memory/max_active (GiB)': '30.38', 'memory/max_allocated (GiB)': '30.38', 'memory/device_reserved (GiB)': '40.1', 'tokens/train_per_sec_per_gpu': '213.5', 'tokens/trainable': 1991044, 'tokens/total': 4352668, 'epoch': '0.5417'} | |
| 54%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 142/263 [1:48:40<1:33:08, 46.19s/it] 54%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 143/263 [1:49:28<1:32:57, 46.48s/it] {'loss': '0.7092', 'grad_norm': '0.1816', 'learning_rate': '0.0001033', 'ppl': '2.032', 'memory/max_active (GiB)': '33.08', 'memory/max_allocated (GiB)': '33.08', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '325.4', 'tokens/trainable': 2006393, 'tokens/total': 4386038, 'epoch': '0.5455'} | |
| 54%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 143/263 [1:49:28<1:32:57, 46.48s/it] 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 144/263 [1:49:54<1:20:19, 40.50s/it] {'loss': '0.6535', 'grad_norm': '0.2587', 'learning_rate': '0.000102', 'ppl': '1.922', 'memory/max_active (GiB)': '24', 'memory/max_allocated (GiB)': '24', 'memory/device_reserved (GiB)': '30.59', 'tokens/train_per_sec_per_gpu': '326.8', 'tokens/trainable': 2015072, 'tokens/total': 4404946, 'epoch': '0.5494'} | |
| 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 144/263 [1:49:54<1:20:19, 40.50s/it] 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 145/263 [1:50:25<1:13:46, 37.51s/it] {'loss': '0.6609', 'grad_norm': '0.2187', 'learning_rate': '0.0001007', 'ppl': '1.937', 'memory/max_active (GiB)': '24.8', 'memory/max_allocated (GiB)': '24.8', 'memory/device_reserved (GiB)': '32.23', 'tokens/train_per_sec_per_gpu': '344.2', 'tokens/trainable': 2025583, 'tokens/total': 4426472, 'epoch': '0.5532'} | |
| 55%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 145/263 [1:50:25<1:13:46, 37.51s/it] 56%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 146/263 [1:51:10<1:17:32, 39.76s/it] {'loss': '0.6478', 'grad_norm': '0.2216', 'learning_rate': '9.934e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '28.07', 'memory/max_allocated (GiB)': '28.07', 'memory/device_reserved (GiB)': '36.78', 'tokens/train_per_sec_per_gpu': '212.7', 'tokens/trainable': 2035157, 'tokens/total': 4447532, 'epoch': '0.557'} | |
| 56%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 146/263 [1:51:10<1:17:32, 39.76s/it] 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 147/263 [1:51:59<1:22:08, 42.49s/it] {'loss': '0.6784', 'grad_norm': '0.2246', 'learning_rate': '9.801e-05', 'ppl': '1.971', 'memory/max_active (GiB)': '24.88', 'memory/max_allocated (GiB)': '24.88', 'memory/device_reserved (GiB)': '31.92', 'tokens/train_per_sec_per_gpu': '225.8', 'tokens/trainable': 2046191, 'tokens/total': 4469110, 'epoch': '0.5608'} | |
| 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 147/263 [1:51:59<1:22:08, 42.49s/it] 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 148/263 [1:52:52<1:27:37, 45.72s/it] {'loss': '0.6373', 'grad_norm': '0.1724', 'learning_rate': '9.669e-05', 'ppl': '1.891', 'memory/max_active (GiB)': '44.04', 'memory/max_allocated (GiB)': '44.04', 'memory/device_reserved (GiB)': '61.51', 'tokens/train_per_sec_per_gpu': '327.3', 'tokens/trainable': 2063617, 'tokens/total': 4505868, 'epoch': '0.5646'} | |
| 56%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 148/263 [1:52:52<1:27:37, 45.72s/it] 57%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 149/263 [1:53:39<1:27:45, 46.19s/it] {'loss': '0.5898', 'grad_norm': '0.1733', 'learning_rate': '9.536e-05', 'ppl': '1.804', 'memory/max_active (GiB)': '31.09', 'memory/max_allocated (GiB)': '31.09', 'memory/device_reserved (GiB)': '37.29', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 2078116, 'tokens/total': 4537032, 'epoch': '0.5684'} | |
| 57%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 149/263 [1:53:39<1:27:45, 46.19s/it] 57%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 150/263 [1:54:25<1:26:36, 45.98s/it] {'loss': '0.5857', 'grad_norm': '0.199', 'learning_rate': '9.404e-05', 'ppl': '1.796', 'memory/max_active (GiB)': '36.3', 'memory/max_allocated (GiB)': '36.3', 'memory/device_reserved (GiB)': '45.92', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 2089719, 'tokens/total': 4564268, 'epoch': '0.5722'} | |
| 57%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 150/263 [1:54:25<1:26:36, 45.98s/it] 57%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 151/263 [1:55:16<1:29:04, 47.72s/it] {'loss': '0.6078', 'grad_norm': '0.2164', 'learning_rate': '9.272e-05', 'ppl': '1.836', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '37.95', 'tokens/train_per_sec_per_gpu': '259.3', 'tokens/trainable': 2103145, 'tokens/total': 4593966, 'epoch': '0.5761'} | |
| 57%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 151/263 [1:55:16<1:29:04, 47.72s/it] 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 152/263 [1:56:03<1:27:31, 47.31s/it] {'loss': '0.7859', 'grad_norm': '0.2286', 'learning_rate': '9.139e-05', 'ppl': '2.194', 'memory/max_active (GiB)': '25.58', 'memory/max_allocated (GiB)': '25.58', 'memory/device_reserved (GiB)': '33.29', 'tokens/train_per_sec_per_gpu': '226', 'tokens/trainable': 2113624, 'tokens/total': 4616602, 'epoch': '0.5799'} | |
| 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 152/263 [1:56:03<1:27:31, 47.31s/it] 58%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 153/263 [1:56:39<1:20:33, 43.94s/it] {'loss': '0.6501', 'grad_norm': '0.2131', 'learning_rate': '9.007e-05', 'ppl': '1.916', 'memory/max_active (GiB)': '23.47', 'memory/max_allocated (GiB)': '23.47', 'memory/device_reserved (GiB)': '29.78', 'tokens/train_per_sec_per_gpu': '255.8', 'tokens/trainable': 2122849, 'tokens/total': 4636788, 'epoch': '0.5837'} | |
| 58%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 153/263 [1:56:39<1:20:33, 43.94s/it] 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 154/263 [1:57:30<1:23:40, 46.06s/it] {'loss': '0.7322', 'grad_norm': '0.2115', 'learning_rate': '8.876e-05', 'ppl': '2.08', 'memory/max_active (GiB)': '38.63', 'memory/max_allocated (GiB)': '38.63', 'memory/device_reserved (GiB)': '53.38', 'tokens/train_per_sec_per_gpu': '324', 'tokens/trainable': 2139377, 'tokens/total': 4672022, 'epoch': '0.5875'} | |
| 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 154/263 [1:57:30<1:23:40, 46.06s/it] 59%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 155/263 [1:58:23<1:26:31, 48.07s/it] {'loss': '0.6393', 'grad_norm': '0.1979', 'learning_rate': '8.744e-05', 'ppl': '1.895', 'memory/max_active (GiB)': '39.73', 'memory/max_allocated (GiB)': '39.73', 'memory/device_reserved (GiB)': '54.95', 'tokens/train_per_sec_per_gpu': '278.7', 'tokens/trainable': 2154077, 'tokens/total': 4701432, 'epoch': '0.5913'} | |
| 59%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 155/263 [1:58:23<1:26:31, 48.07s/it] 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 156/263 [1:59:20<1:30:31, 50.76s/it] {'loss': '0.6361', 'grad_norm': '0.1782', 'learning_rate': '8.613e-05', 'ppl': '1.889', 'memory/max_active (GiB)': '42.74', 'memory/max_allocated (GiB)': '42.74', 'memory/device_reserved (GiB)': '59.9', 'tokens/train_per_sec_per_gpu': '296.8', 'tokens/trainable': 2171008, 'tokens/total': 4736142, 'epoch': '0.5951'} | |
| 59%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 156/263 [1:59:20<1:30:31, 50.76s/it] 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 157/263 [2:00:09<1:28:43, 50.22s/it] {'loss': '0.7478', 'grad_norm': '0.185', 'learning_rate': '8.481e-05', 'ppl': '2.112', 'memory/max_active (GiB)': '39.56', 'memory/max_allocated (GiB)': '39.56', 'memory/device_reserved (GiB)': '54.81', 'tokens/train_per_sec_per_gpu': '291.1', 'tokens/trainable': 2185260, 'tokens/total': 4769674, 'epoch': '0.599'} | |
| 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 157/263 [2:00:09<1:28:43, 50.22s/it] 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 158/263 [2:01:16<1:36:55, 55.38s/it] {'loss': '0.6923', 'grad_norm': '0.1927', 'learning_rate': '8.351e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '51.52', 'memory/max_allocated (GiB)': '51.52', 'memory/device_reserved (GiB)': '73.17', 'tokens/train_per_sec_per_gpu': '299.4', 'tokens/trainable': 2205449, 'tokens/total': 4813544, 'epoch': '0.6028'} | |
| 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 158/263 [2:01:16<1:36:55, 55.38s/it] 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 159/263 [2:02:19<1:39:59, 57.69s/it] {'loss': '0.6566', 'grad_norm': '0.1942', 'learning_rate': '8.22e-05', 'ppl': '1.928', 'memory/max_active (GiB)': '45.07', 'memory/max_allocated (GiB)': '45.07', 'memory/device_reserved (GiB)': '63.03', 'tokens/train_per_sec_per_gpu': '265.1', 'tokens/trainable': 2222171, 'tokens/total': 4848392, 'epoch': '0.6066'} | |
| 60%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 159/263 [2:02:19<1:39:59, 57.69s/it] 61%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 160/263 [2:03:11<1:36:19, 56.11s/it] {'loss': '0.6622', 'grad_norm': '0.1977', 'learning_rate': '8.09e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '46.42', 'memory/max_allocated (GiB)': '46.42', 'memory/device_reserved (GiB)': '65.06', 'tokens/train_per_sec_per_gpu': '337.5', 'tokens/trainable': 2239867, 'tokens/total': 4886222, 'epoch': '0.6104'} | |
| 61%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 160/263 [2:03:11<1:36:19, 56.11s/it] 61%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 161/263 [2:03:47<1:24:41, 49.82s/it] {'loss': '0.6882', 'grad_norm': '0.1678', 'learning_rate': '7.96e-05', 'ppl': '1.99', 'memory/max_active (GiB)': '25.15', 'memory/max_allocated (GiB)': '25.15', 'memory/device_reserved (GiB)': '32.47', 'tokens/train_per_sec_per_gpu': '420.2', 'tokens/trainable': 2254631, 'tokens/total': 4914572, 'epoch': '0.6142'} | |
| 61%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 161/263 [2:03:47<1:24:41, 49.82s/it] 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 162/263 [2:04:31<1:21:04, 48.17s/it] {'loss': '0.6634', 'grad_norm': '0.254', 'learning_rate': '7.83e-05', 'ppl': '1.941', 'memory/max_active (GiB)': '25.88', 'memory/max_allocated (GiB)': '25.88', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '195.9', 'tokens/trainable': 2263311, 'tokens/total': 4934728, 'epoch': '0.618'} | |
| 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 162/263 [2:04:31<1:21:04, 48.17s/it] 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 163/263 [2:05:28<1:24:55, 50.96s/it] {'loss': '0.6232', 'grad_norm': '0.1764', 'learning_rate': '7.701e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '33.21', 'memory/max_allocated (GiB)': '33.21', 'memory/device_reserved (GiB)': '41.49', 'tokens/train_per_sec_per_gpu': '270', 'tokens/trainable': 2278829, 'tokens/total': 4967958, 'epoch': '0.6218'} | |
| 62%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 163/263 [2:05:28<1:24:55, 50.96s/it] 62%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 164/263 [2:06:01<1:14:46, 45.32s/it] {'loss': '0.7008', 'grad_norm': '0.1909', 'learning_rate': '7.572e-05', 'ppl': '2.015', 'memory/max_active (GiB)': '27.44', 'memory/max_allocated (GiB)': '27.44', 'memory/device_reserved (GiB)': '35.74', 'tokens/train_per_sec_per_gpu': '401.3', 'tokens/trainable': 2291731, 'tokens/total': 4994350, 'epoch': '0.6257'} | |
| 62%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 164/263 [2:06:01<1:14:46, 45.32s/it] 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 165/263 [2:06:33<1:07:45, 41.49s/it] {'loss': '0.6839', 'grad_norm': '0.213', 'learning_rate': '7.444e-05', 'ppl': '1.982', 'memory/max_active (GiB)': '31.91', 'memory/max_allocated (GiB)': '31.91', 'memory/device_reserved (GiB)': '42.43', 'tokens/train_per_sec_per_gpu': '404.2', 'tokens/trainable': 2304893, 'tokens/total': 5018514, 'epoch': '0.6295'} | |
| 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 165/263 [2:06:33<1:07:45, 41.49s/it] 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 166/263 [2:07:29<1:13:54, 45.72s/it] {'loss': '0.6054', 'grad_norm': '0.1739', 'learning_rate': '7.316e-05', 'ppl': '1.832', 'memory/max_active (GiB)': '35.17', 'memory/max_allocated (GiB)': '35.17', 'memory/device_reserved (GiB)': '44.2', 'tokens/train_per_sec_per_gpu': '285.8', 'tokens/trainable': 2320783, 'tokens/total': 5050822, 'epoch': '0.6333'} | |
| 63%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 166/263 [2:07:29<1:13:54, 45.72s/it] 63%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 167/263 [2:08:27<1:19:11, 49.50s/it] {'loss': '0.6887', 'grad_norm': '0.1949', 'learning_rate': '7.188e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '40.38', 'memory/max_allocated (GiB)': '40.38', 'memory/device_reserved (GiB)': '56.06', 'tokens/train_per_sec_per_gpu': '304.9', 'tokens/trainable': 2338559, 'tokens/total': 5082648, 'epoch': '0.6371'} | |
| 63%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 167/263 [2:08:27<1:19:11, 49.50s/it] 64%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 168/263 [2:09:09<1:14:44, 47.20s/it] {'loss': '0.6576', 'grad_norm': '0.246', 'learning_rate': '7.061e-05', 'ppl': '1.93', 'memory/max_active (GiB)': '22.77', 'memory/max_allocated (GiB)': '22.77', 'memory/device_reserved (GiB)': '33.36', 'tokens/train_per_sec_per_gpu': '217.1', 'tokens/trainable': 2347641, 'tokens/total': 5103024, 'epoch': '0.6409'} | |
| 64%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 168/263 [2:09:09<1:14:44, 47.20s/it] 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 169/263 [2:09:58<1:14:40, 47.66s/it] {'loss': '0.5939', 'grad_norm': '0.1881', 'learning_rate': '6.935e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '49.05', 'memory/max_allocated (GiB)': '49.05', 'memory/device_reserved (GiB)': '69.16', 'tokens/train_per_sec_per_gpu': '299.7', 'tokens/trainable': 2362246, 'tokens/total': 5135012, 'epoch': '0.6447'} | |
| 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 169/263 [2:09:58<1:14:40, 47.66s/it] 65%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 170/263 [2:10:42<1:12:29, 46.77s/it] {'loss': '0.6675', 'grad_norm': '0.1941', 'learning_rate': '6.809e-05', 'ppl': '1.949', 'memory/max_active (GiB)': '40.07', 'memory/max_allocated (GiB)': '40.07', 'memory/device_reserved (GiB)': '55.49', 'tokens/train_per_sec_per_gpu': '330.2', 'tokens/trainable': 2376998, 'tokens/total': 5165914, 'epoch': '0.6485'} | |
| 65%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 170/263 [2:10:42<1:12:29, 46.77s/it] 65%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 171/263 [2:11:22<1:08:18, 44.54s/it] {'loss': '0.662', 'grad_norm': '0.2005', 'learning_rate': '6.684e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '26.99', 'memory/max_allocated (GiB)': '26.99', 'memory/device_reserved (GiB)': '35.18', 'tokens/train_per_sec_per_gpu': '334.3', 'tokens/trainable': 2390156, 'tokens/total': 5192072, 'epoch': '0.6524'} | |
| 65%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 171/263 [2:11:22<1:08:18, 44.54s/it] 65%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 172/263 [2:12:13<1:10:49, 46.70s/it] {'loss': '0.7026', 'grad_norm': '0.2259', 'learning_rate': '6.559e-05', 'ppl': '2.019', 'memory/max_active (GiB)': '47.05', 'memory/max_allocated (GiB)': '47.05', 'memory/device_reserved (GiB)': '66.2', 'tokens/train_per_sec_per_gpu': '264.1', 'tokens/trainable': 2403814, 'tokens/total': 5223520, 'epoch': '0.6562'} | |
| 65%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 172/263 [2:12:13<1:10:49, 46.70s/it] 66%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 173/263 [2:13:04<1:11:49, 47.89s/it] {'loss': '0.6707', 'grad_norm': '0.197', 'learning_rate': '6.435e-05', 'ppl': '1.956', 'memory/max_active (GiB)': '23.56', 'memory/max_allocated (GiB)': '23.56', 'memory/device_reserved (GiB)': '30.05', 'tokens/train_per_sec_per_gpu': '217.8', 'tokens/trainable': 2414847, 'tokens/total': 5247654, 'epoch': '0.66'} | |
| 66%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 173/263 [2:13:04<1:11:49, 47.89s/it] 66%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 174/263 [2:13:57<1:13:05, 49.28s/it] {'loss': '0.5625', 'grad_norm': '0.1798', 'learning_rate': '6.311e-05', 'ppl': '1.755', 'memory/max_active (GiB)': '30.71', 'memory/max_allocated (GiB)': '30.71', 'memory/device_reserved (GiB)': '40.59', 'tokens/train_per_sec_per_gpu': '295.5', 'tokens/trainable': 2430368, 'tokens/total': 5276410, 'epoch': '0.6638'} | |
| 66%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 174/263 [2:13:57<1:13:05, 49.28s/it] 67%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 175/263 [2:14:44<1:11:22, 48.66s/it] {'loss': '0.6375', 'grad_norm': '0.1961', 'learning_rate': '6.188e-05', 'ppl': '1.892', 'memory/max_active (GiB)': '28.14', 'memory/max_allocated (GiB)': '28.14', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '234.5', 'tokens/trainable': 2441441, 'tokens/total': 5302790, 'epoch': '0.6676'} | |
| 67%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 175/263 [2:14:44<1:11:22, 48.66s/it] 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 176/263 [2:15:37<1:12:44, 50.17s/it] {'loss': '0.6539', 'grad_norm': '0.1857', 'learning_rate': '6.066e-05', 'ppl': '1.923', 'memory/max_active (GiB)': '44.34', 'memory/max_allocated (GiB)': '44.34', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '260.5', 'tokens/trainable': 2455423, 'tokens/total': 5333674, 'epoch': '0.6714'} | |
| 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 176/263 [2:15:37<1:12:44, 50.17s/it] 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 177/263 [2:16:24<1:10:15, 49.01s/it] {'loss': '0.5762', 'grad_norm': '0.1929', 'learning_rate': '5.945e-05', 'ppl': '1.779', 'memory/max_active (GiB)': '28.62', 'memory/max_allocated (GiB)': '28.62', 'memory/device_reserved (GiB)': '61.98', 'tokens/train_per_sec_per_gpu': '251.4', 'tokens/trainable': 2467068, 'tokens/total': 5359564, 'epoch': '0.6753'} | |
| 67%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 177/263 [2:16:24<1:10:15, 49.01s/it] 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 178/263 [2:17:02<1:05:00, 45.89s/it] {'loss': '0.6303', 'grad_norm': '0.1936', 'learning_rate': '5.824e-05', 'ppl': '1.878', 'memory/max_active (GiB)': '33.46', 'memory/max_allocated (GiB)': '33.46', 'memory/device_reserved (GiB)': '42.02', 'tokens/train_per_sec_per_gpu': '384.4', 'tokens/trainable': 2481907, 'tokens/total': 5388678, 'epoch': '0.6791'} | |
| 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 178/263 [2:17:02<1:05:00, 45.89s/it] 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 179/263 [2:17:37<59:27, 42.47s/it] {'loss': '0.6012', 'grad_norm': '0.203', 'learning_rate': '5.704e-05', 'ppl': '1.824', 'memory/max_active (GiB)': '25.76', 'memory/max_allocated (GiB)': '25.76', 'memory/device_reserved (GiB)': '33.21', 'tokens/train_per_sec_per_gpu': '338.2', 'tokens/trainable': 2493569, 'tokens/total': 5414394, 'epoch': '0.6829'} | |
| 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 179/263 [2:17:37<59:27, 42.47s/it] 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 180/263 [2:18:10<54:55, 39.70s/it] {'loss': '0.6847', 'grad_norm': '0.1948', 'learning_rate': '5.585e-05', 'ppl': '1.983', 'memory/max_active (GiB)': '30.09', 'memory/max_allocated (GiB)': '30.09', 'memory/device_reserved (GiB)': '39.71', 'tokens/train_per_sec_per_gpu': '404.4', 'tokens/trainable': 2507016, 'tokens/total': 5440266, 'epoch': '0.6867'} | |
| 68%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 180/263 [2:18:10<54:55, 39.70s/it] 69%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 181/263 [2:18:44<51:47, 37.90s/it] {'loss': '0.6923', 'grad_norm': '0.2085', 'learning_rate': '5.466e-05', 'ppl': '1.998', 'memory/max_active (GiB)': '35.58', 'memory/max_allocated (GiB)': '35.58', 'memory/device_reserved (GiB)': '44.79', 'tokens/train_per_sec_per_gpu': '367.1', 'tokens/trainable': 2519381, 'tokens/total': 5467466, 'epoch': '0.6905'} | |
| 69%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 181/263 [2:18:44<51:47, 37.90s/it] 69%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 182/263 [2:19:19<50:05, 37.10s/it] {'loss': '0.7096', 'grad_norm': '0.196', 'learning_rate': '5.348e-05', 'ppl': '2.033', 'memory/max_active (GiB)': '30.89', 'memory/max_allocated (GiB)': '30.89', 'memory/device_reserved (GiB)': '40.73', 'tokens/train_per_sec_per_gpu': '379.8', 'tokens/trainable': 2532772, 'tokens/total': 5497208, 'epoch': '0.6943'} | |
| 69%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 182/263 [2:19:19<50:05, 37.10s/it] 70%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 183/263 [2:19:53<48:13, 36.17s/it] {'loss': '0.676', 'grad_norm': '0.2051', 'learning_rate': '5.231e-05', 'ppl': '1.966', 'memory/max_active (GiB)': '24.11', 'memory/max_allocated (GiB)': '24.11', 'memory/device_reserved (GiB)': '38.66', 'tokens/train_per_sec_per_gpu': '345.2', 'tokens/trainable': 2544502, 'tokens/total': 5521580, 'epoch': '0.6981'} | |
| 70%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 183/263 [2:19:53<48:13, 36.17s/it] 70%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 184/263 [2:20:43<52:59, 40.25s/it] {'loss': '0.6542', 'grad_norm': '0.2071', 'learning_rate': '5.115e-05', 'ppl': '1.924', 'memory/max_active (GiB)': '39.48', 'memory/max_allocated (GiB)': '39.48', 'memory/device_reserved (GiB)': '54.69', 'tokens/train_per_sec_per_gpu': '310.1', 'tokens/trainable': 2559938, 'tokens/total': 5553304, 'epoch': '0.702'} | |
| 70%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 184/263 [2:20:43<52:59, 40.25s/it] 70%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 185/263 [2:21:23<52:24, 40.31s/it] {'loss': '0.6775', 'grad_norm': '0.2571', 'learning_rate': '5e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '38.61', 'memory/max_allocated (GiB)': '38.61', 'memory/device_reserved (GiB)': '53.19', 'tokens/train_per_sec_per_gpu': '308.9', 'tokens/trainable': 2572432, 'tokens/total': 5582308, 'epoch': '0.7058'} | |
| 70%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 185/263 [2:21:23<52:24, 40.31s/it] 71%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 186/263 [2:22:13<55:23, 43.16s/it] {'loss': '0.6412', 'grad_norm': '0.2141', 'learning_rate': '4.886e-05', 'ppl': '1.899', 'memory/max_active (GiB)': '50.63', 'memory/max_allocated (GiB)': '50.63', 'memory/device_reserved (GiB)': '71.62', 'tokens/train_per_sec_per_gpu': '345.7', 'tokens/trainable': 2589650, 'tokens/total': 5622242, 'epoch': '0.7096'} | |
| 71%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 186/263 [2:22:13<55:23, 43.16s/it] 71%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 187/263 [2:22:59<55:44, 44.01s/it] {'loss': '0.6154', 'grad_norm': '0.1996', 'learning_rate': '4.772e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '33.16', 'memory/max_allocated (GiB)': '33.16', 'memory/device_reserved (GiB)': '70.77', 'tokens/train_per_sec_per_gpu': '289.1', 'tokens/trainable': 2602951, 'tokens/total': 5652818, 'epoch': '0.7134'} | |
| 71%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 187/263 [2:22:59<55:44, 44.01s/it] 71%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 188/263 [2:23:52<58:33, 46.84s/it] {'loss': '0.6231', 'grad_norm': '0.2165', 'learning_rate': '4.66e-05', 'ppl': '1.865', 'memory/max_active (GiB)': '55.1', 'memory/max_allocated (GiB)': '55.1', 'memory/device_reserved (GiB)': '78.44', 'tokens/train_per_sec_per_gpu': '241.5', 'tokens/trainable': 2615857, 'tokens/total': 5682190, 'epoch': '0.7172'} | |
| 71%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 188/263 [2:23:52<58:33, 46.84s/it] 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 189/263 [2:24:51<1:02:07, 50.37s/it] {'loss': '0.644', 'grad_norm': '0.1827', 'learning_rate': '4.548e-05', 'ppl': '1.904', 'memory/max_active (GiB)': '40.42', 'memory/max_allocated (GiB)': '40.42', 'memory/device_reserved (GiB)': '56.19', 'tokens/train_per_sec_per_gpu': '308.8', 'tokens/trainable': 2633955, 'tokens/total': 5718572, 'epoch': '0.721'} | |
| 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 189/263 [2:24:51<1:02:07, 50.37s/it] 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 190/263 [2:25:38<59:58, 49.29s/it] {'loss': '0.6435', 'grad_norm': '0.2112', 'learning_rate': '4.437e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '39.61', 'memory/max_allocated (GiB)': '39.61', 'memory/device_reserved (GiB)': '54.75', 'tokens/train_per_sec_per_gpu': '314.9', 'tokens/trainable': 2648683, 'tokens/total': 5751500, 'epoch': '0.7248'} | |
| 72%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 190/263 [2:25:38<59:58, 49.29s/it] 73%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 191/263 [2:26:29<59:50, 49.87s/it] {'loss': '0.6434', 'grad_norm': '0.1974', 'learning_rate': '4.328e-05', 'ppl': '1.903', 'memory/max_active (GiB)': '48.38', 'memory/max_allocated (GiB)': '48.38', 'memory/device_reserved (GiB)': '68.25', 'tokens/train_per_sec_per_gpu': '370.3', 'tokens/trainable': 2667651, 'tokens/total': 5795452, 'epoch': '0.7287'} | |
| 73%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 191/263 [2:26:29<59:50, 49.87s/it] 73%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 192/263 [2:27:33<1:04:07, 54.19s/it] {'loss': '0.7199', 'grad_norm': '0.1943', 'learning_rate': '4.219e-05', 'ppl': '2.054', 'memory/max_active (GiB)': '51.55', 'memory/max_allocated (GiB)': '51.55', 'memory/device_reserved (GiB)': '72.91', 'tokens/train_per_sec_per_gpu': '241.4', 'tokens/trainable': 2683164, 'tokens/total': 5831970, 'epoch': '0.7325'} | |
| 73%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 192/263 [2:27:33<1:04:07, 54.19s/it] 73%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 193/263 [2:28:31<1:04:25, 55.22s/it] {'loss': '0.6151', 'grad_norm': '0.212', 'learning_rate': '4.111e-05', 'ppl': '1.85', 'memory/max_active (GiB)': '45.42', 'memory/max_allocated (GiB)': '45.42', 'memory/device_reserved (GiB)': '63.73', 'tokens/train_per_sec_per_gpu': '237.2', 'tokens/trainable': 2696830, 'tokens/total': 5864046, 'epoch': '0.7363'} | |
| 73%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 193/263 [2:28:31<1:04:25, 55.22s/it] 74%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 194/263 [2:29:15<59:32, 51.78s/it] {'loss': '0.662', 'grad_norm': '0.1948', 'learning_rate': '4.005e-05', 'ppl': '1.939', 'memory/max_active (GiB)': '36.93', 'memory/max_allocated (GiB)': '36.93', 'memory/device_reserved (GiB)': '46.86', 'tokens/train_per_sec_per_gpu': '370.9', 'tokens/trainable': 2713056, 'tokens/total': 5897816, 'epoch': '0.7401'} | |
| 74%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 194/263 [2:29:15<59:32, 51.78s/it] 74%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 195/263 [2:29:58<55:48, 49.24s/it] {'loss': '0.7405', 'grad_norm': '0.2793', 'learning_rate': '3.899e-05', 'ppl': '2.097', 'memory/max_active (GiB)': '25.49', 'memory/max_allocated (GiB)': '25.49', 'memory/device_reserved (GiB)': '32.93', 'tokens/train_per_sec_per_gpu': '231.7', 'tokens/trainable': 2723094, 'tokens/total': 5917064, 'epoch': '0.7439'} | |
| 74%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 195/263 [2:29:58<55:48, 49.24s/it] 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 196/263 [2:30:43<53:30, 47.92s/it] {'loss': '0.6287', 'grad_norm': '0.2298', 'learning_rate': '3.795e-05', 'ppl': '1.875', 'memory/max_active (GiB)': '21.3', 'memory/max_allocated (GiB)': '21.3', 'memory/device_reserved (GiB)': '26.55', 'tokens/train_per_sec_per_gpu': '184.1', 'tokens/trainable': 2731350, 'tokens/total': 5936682, 'epoch': '0.7477'} | |
| 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 196/263 [2:30:43<53:30, 47.92s/it] 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 197/263 [2:31:37<54:37, 49.66s/it] {'loss': '0.6935', 'grad_norm': '0.2097', 'learning_rate': '3.691e-05', 'ppl': '2.001', 'memory/max_active (GiB)': '30.31', 'memory/max_allocated (GiB)': '30.31', 'memory/device_reserved (GiB)': '40.12', 'tokens/train_per_sec_per_gpu': '254.8', 'tokens/trainable': 2745037, 'tokens/total': 5968342, 'epoch': '0.7515'} | |
| 75%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 197/263 [2:31:37<54:37, 49.66s/it] 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 198/263 [2:32:29<54:33, 50.36s/it] {'loss': '0.6516', 'grad_norm': '0.2135', 'learning_rate': '3.589e-05', 'ppl': '1.919', 'memory/max_active (GiB)': '35.1', 'memory/max_allocated (GiB)': '35.1', 'memory/device_reserved (GiB)': '44.11', 'tokens/train_per_sec_per_gpu': '247.8', 'tokens/trainable': 2757923, 'tokens/total': 5997292, 'epoch': '0.7554'} | |
| 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 198/263 [2:32:29<54:33, 50.36s/it] 76%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 199/263 [2:33:29<56:57, 53.40s/it] {'loss': '0.6918', 'grad_norm': '0.1828', 'learning_rate': '3.488e-05', 'ppl': '1.997', 'memory/max_active (GiB)': '56.91', 'memory/max_allocated (GiB)': '56.91', 'memory/device_reserved (GiB)': '73.99', 'tokens/train_per_sec_per_gpu': '302.4', 'tokens/trainable': 2776213, 'tokens/total': 6038550, 'epoch': '0.7592'} | |
| 76%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 199/263 [2:33:29<56:57, 53.40s/it] 76%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 200/263 [2:34:22<56:02, 53.37s/it] {'loss': '0.6774', 'grad_norm': '0.2068', 'learning_rate': '3.388e-05', 'ppl': '1.969', 'memory/max_active (GiB)': '43.54', 'memory/max_allocated (GiB)': '43.54', 'memory/device_reserved (GiB)': '60.76', 'tokens/train_per_sec_per_gpu': '271.4', 'tokens/trainable': 2790680, 'tokens/total': 6069138, 'epoch': '0.763'} | |
| 76%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 200/263 [2:34:22<56:02, 53.37s/it] 76%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 201/263 [2:35:22<57:08, 55.30s/it] {'loss': '0.6853', 'grad_norm': '0.2284', 'learning_rate': '3.289e-05', 'ppl': '1.984', 'memory/max_active (GiB)': '55.21', 'memory/max_allocated (GiB)': '55.21', 'memory/device_reserved (GiB)': '78.48', 'tokens/train_per_sec_per_gpu': '256.8', 'tokens/trainable': 2806039, 'tokens/total': 6104430, 'epoch': '0.7668'} | |
| 76%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 201/263 [2:35:22<57:08, 55.30s/it] 77%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 202/263 [2:35:58<50:17, 49.47s/it] {'loss': '0.7109', 'grad_norm': '0.2175', 'learning_rate': '3.191e-05', 'ppl': '2.036', 'memory/max_active (GiB)': '23.78', 'memory/max_allocated (GiB)': '23.78', 'memory/device_reserved (GiB)': '30.28', 'tokens/train_per_sec_per_gpu': '336.1', 'tokens/trainable': 2818090, 'tokens/total': 6127762, 'epoch': '0.7706'} | |
| 77%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 202/263 [2:35:58<50:17, 49.47s/it] 77%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 203/263 [2:36:28<43:45, 43.75s/it] {'loss': '0.6929', 'grad_norm': '0.2504', 'learning_rate': '3.095e-05', 'ppl': '2', 'memory/max_active (GiB)': '27.24', 'memory/max_allocated (GiB)': '27.24', 'memory/device_reserved (GiB)': '30.67', 'tokens/train_per_sec_per_gpu': '347.7', 'tokens/trainable': 2828664, 'tokens/total': 6151536, 'epoch': '0.7744'} | |
| 77%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 203/263 [2:36:28<43:45, 43.75s/it] 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 204/263 [2:37:16<44:12, 44.96s/it] {'loss': '0.6381', 'grad_norm': '0.1846', 'learning_rate': '3e-05', 'ppl': '1.893', 'memory/max_active (GiB)': '35.73', 'memory/max_allocated (GiB)': '35.73', 'memory/device_reserved (GiB)': '45.1', 'tokens/train_per_sec_per_gpu': '335.6', 'tokens/trainable': 2844697, 'tokens/total': 6186540, 'epoch': '0.7783'} | |
| 78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 204/263 [2:37:16<44:12, 44.96s/it] 78%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 205/263 [2:38:09<45:46, 47.35s/it] {'loss': '0.578', 'grad_norm': '0.1669', 'learning_rate': '2.906e-05', 'ppl': '1.783', 'memory/max_active (GiB)': '38.6', 'memory/max_allocated (GiB)': '38.6', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '300.8', 'tokens/trainable': 2860615, 'tokens/total': 6225342, 'epoch': '0.7821'} | |
| 78%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 205/263 [2:38:09<45:46, 47.35s/it] 78%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 206/263 [2:38:55<44:36, 46.95s/it] {'loss': '0.7123', 'grad_norm': '0.2747', 'learning_rate': '2.813e-05', 'ppl': '2.039', 'memory/max_active (GiB)': '33.82', 'memory/max_allocated (GiB)': '33.82', 'memory/device_reserved (GiB)': '53.29', 'tokens/train_per_sec_per_gpu': '214.9', 'tokens/trainable': 2870504, 'tokens/total': 6249862, 'epoch': '0.7859'} | |
| 78%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 206/263 [2:38:55<44:36, 46.95s/it] 79%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 207/263 [2:39:38<42:35, 45.64s/it] {'loss': '0.5941', 'grad_norm': '0.198', 'learning_rate': '2.721e-05', 'ppl': '1.811', 'memory/max_active (GiB)': '28.76', 'memory/max_allocated (GiB)': '28.76', 'memory/device_reserved (GiB)': '42.35', 'tokens/train_per_sec_per_gpu': '301.9', 'tokens/trainable': 2883356, 'tokens/total': 6278118, 'epoch': '0.7897'} | |
| 79%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 207/263 [2:39:38<42:35, 45.64s/it] 79%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 208/263 [2:40:17<40:11, 43.85s/it] {'loss': '0.7291', 'grad_norm': '0.1958', 'learning_rate': '2.631e-05', 'ppl': '2.073', 'memory/max_active (GiB)': '29.79', 'memory/max_allocated (GiB)': '29.79', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '300.4', 'tokens/trainable': 2895280, 'tokens/total': 6304026, 'epoch': '0.7935'} | |
| 79%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 208/263 [2:40:17<40:11, 43.85s/it] 79%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 209/263 [2:40:56<38:06, 42.35s/it] {'loss': '0.6712', 'grad_norm': '0.209', 'learning_rate': '2.542e-05', 'ppl': '1.957', 'memory/max_active (GiB)': '28.91', 'memory/max_allocated (GiB)': '28.91', 'memory/device_reserved (GiB)': '37.99', 'tokens/train_per_sec_per_gpu': '305.4', 'tokens/trainable': 2907146, 'tokens/total': 6329922, 'epoch': '0.7973'} | |
| 79%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 209/263 [2:40:56<38:06, 42.35s/it] 80%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 210/263 [2:41:38<37:14, 42.17s/it] {'loss': '0.6474', 'grad_norm': '0.2265', 'learning_rate': '2.454e-05', 'ppl': '1.911', 'memory/max_active (GiB)': '33.6', 'memory/max_allocated (GiB)': '33.6', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '257.8', 'tokens/trainable': 2917904, 'tokens/total': 6354510, 'epoch': '0.8011'} | |
| 80%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 210/263 [2:41:38<37:14, 42.17s/it] 80%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 211/263 [2:42:15<35:18, 40.74s/it] {'loss': '0.6219', 'grad_norm': '0.1866', 'learning_rate': '2.368e-05', 'ppl': '1.863', 'memory/max_active (GiB)': '23.98', 'memory/max_allocated (GiB)': '23.98', 'memory/device_reserved (GiB)': '42.13', 'tokens/train_per_sec_per_gpu': '303.4', 'tokens/trainable': 2929254, 'tokens/total': 6379050, 'epoch': '0.805'} | |
| 80%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 211/263 [2:42:15<35:18, 40.74s/it] 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 212/263 [2:43:06<37:06, 43.66s/it] {'loss': '0.5683', 'grad_norm': '0.1713', 'learning_rate': '2.283e-05', 'ppl': '1.765', 'memory/max_active (GiB)': '36.5', 'memory/max_allocated (GiB)': '36.5', 'memory/device_reserved (GiB)': '46.02', 'tokens/train_per_sec_per_gpu': '342.5', 'tokens/trainable': 2946538, 'tokens/total': 6416330, 'epoch': '0.8088'} | |
| 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 212/263 [2:43:06<37:06, 43.66s/it] 81%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 213/263 [2:44:05<40:19, 48.39s/it] {'loss': '0.6341', 'grad_norm': '0.1864', 'learning_rate': '2.199e-05', 'ppl': '1.885', 'memory/max_active (GiB)': '52.19', 'memory/max_allocated (GiB)': '52.19', 'memory/device_reserved (GiB)': '74.01', 'tokens/train_per_sec_per_gpu': '274.8', 'tokens/trainable': 2962875, 'tokens/total': 6452688, 'epoch': '0.8126'} | |
| 81%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 213/263 [2:44:05<40:19, 48.39s/it] 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 214/263 [2:44:59<40:41, 49.83s/it] {'loss': '0.6799', 'grad_norm': '0.2418', 'learning_rate': '2.117e-05', 'ppl': '1.974', 'memory/max_active (GiB)': '33.52', 'memory/max_allocated (GiB)': '33.52', 'memory/device_reserved (GiB)': '42', 'tokens/train_per_sec_per_gpu': '259.6', 'tokens/trainable': 2976675, 'tokens/total': 6485638, 'epoch': '0.8164'} | |
| 81%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 214/263 [2:44:59<40:41, 49.83s/it] 82%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 215/263 [2:45:35<36:33, 45.70s/it] {'loss': '0.5868', 'grad_norm': '0.2032', 'learning_rate': '2.036e-05', 'ppl': '1.798', 'memory/max_active (GiB)': '24.76', 'memory/max_allocated (GiB)': '24.76', 'memory/device_reserved (GiB)': '31.8', 'tokens/train_per_sec_per_gpu': '283.4', 'tokens/trainable': 2986893, 'tokens/total': 6509020, 'epoch': '0.8202'} | |
| 82%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 215/263 [2:45:35<36:33, 45.70s/it] 82%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 216/263 [2:46:18<35:20, 45.12s/it] {'loss': '0.7066', 'grad_norm': '0.2272', 'learning_rate': '1.957e-05', 'ppl': '2.027', 'memory/max_active (GiB)': '34.06', 'memory/max_allocated (GiB)': '34.06', 'memory/device_reserved (GiB)': '42.78', 'tokens/train_per_sec_per_gpu': '281.2', 'tokens/trainable': 2999202, 'tokens/total': 6537048, 'epoch': '0.824'} | |
| 82%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 216/263 [2:46:18<35:20, 45.12s/it] 83%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 217/263 [2:46:58<33:17, 43.41s/it] {'loss': '0.6148', 'grad_norm': '0.1853', 'learning_rate': '1.879e-05', 'ppl': '1.849', 'memory/max_active (GiB)': '27.75', 'memory/max_allocated (GiB)': '27.75', 'memory/device_reserved (GiB)': '36.31', 'tokens/train_per_sec_per_gpu': '309.6', 'tokens/trainable': 3011411, 'tokens/total': 6563400, 'epoch': '0.8278'} | |
| 83%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 217/263 [2:46:58<33:17, 43.41s/it] 83%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 218/263 [2:47:46<33:43, 44.97s/it] {'loss': '0.6522', 'grad_norm': '0.1957', 'learning_rate': '1.802e-05', 'ppl': '1.92', 'memory/max_active (GiB)': '31.19', 'memory/max_allocated (GiB)': '31.19', 'memory/device_reserved (GiB)': '41.37', 'tokens/train_per_sec_per_gpu': '279.6', 'tokens/trainable': 3024998, 'tokens/total': 6593488, 'epoch': '0.8317'} | |
| 83%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 218/263 [2:47:46<33:43, 44.97s/it] 83%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 219/263 [2:48:33<33:24, 45.57s/it] {'loss': '0.6208', 'grad_norm': '0.1843', 'learning_rate': '1.727e-05', 'ppl': '1.86', 'memory/max_active (GiB)': '30.24', 'memory/max_allocated (GiB)': '30.24', 'memory/device_reserved (GiB)': '39.91', 'tokens/train_per_sec_per_gpu': '332.7', 'tokens/trainable': 3040622, 'tokens/total': 6624140, 'epoch': '0.8355'} | |
| 83%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 219/263 [2:48:33<33:24, 45.57s/it] 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 220/263 [2:49:22<33:25, 46.63s/it] {'loss': '0.6097', 'grad_norm': '0.1874', 'learning_rate': '1.653e-05', 'ppl': '1.84', 'memory/max_active (GiB)': '37.68', 'memory/max_allocated (GiB)': '37.68', 'memory/device_reserved (GiB)': '47.76', 'tokens/train_per_sec_per_gpu': '277.7', 'tokens/trainable': 3054263, 'tokens/total': 6654518, 'epoch': '0.8393'} | |
| 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 220/263 [2:49:22<33:25, 46.63s/it] 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 221/263 [2:50:14<33:34, 47.97s/it] {'loss': '0.6886', 'grad_norm': '0.2007', 'learning_rate': '1.581e-05', 'ppl': '1.991', 'memory/max_active (GiB)': '29.37', 'memory/max_allocated (GiB)': '29.37', 'memory/device_reserved (GiB)': '38.64', 'tokens/train_per_sec_per_gpu': '249.6', 'tokens/trainable': 3067017, 'tokens/total': 6683614, 'epoch': '0.8431'} | |
| 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 221/263 [2:50:14<33:34, 47.97s/it] 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 222/263 [2:51:01<32:41, 47.83s/it] {'loss': '0.629', 'grad_norm': '0.2061', 'learning_rate': '1.51e-05', 'ppl': '1.876', 'memory/max_active (GiB)': '47.52', 'memory/max_allocated (GiB)': '47.52', 'memory/device_reserved (GiB)': '66.84', 'tokens/train_per_sec_per_gpu': '305', 'tokens/trainable': 3081510, 'tokens/total': 6718874, 'epoch': '0.8469'} | |
| 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 222/263 [2:51:01<32:41, 47.83s/it] 85%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 223/263 [2:51:45<31:07, 46.68s/it] {'loss': '0.6047', 'grad_norm': '0.1965', 'learning_rate': '1.441e-05', 'ppl': '1.831', 'memory/max_active (GiB)': '37.79', 'memory/max_allocated (GiB)': '37.79', 'memory/device_reserved (GiB)': '47.88', 'tokens/train_per_sec_per_gpu': '297.3', 'tokens/trainable': 3094585, 'tokens/total': 6747542, 'epoch': '0.8507'} | |
| 85%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 223/263 [2:51:45<31:07, 46.68s/it] 85%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 224/263 [2:52:53<34:29, 53.05s/it] {'loss': '0.6172', 'grad_norm': '0.1745', 'learning_rate': '1.373e-05', 'ppl': '1.854', 'memory/max_active (GiB)': '48.12', 'memory/max_allocated (GiB)': '48.12', 'memory/device_reserved (GiB)': '67.76', 'tokens/train_per_sec_per_gpu': '293.9', 'tokens/trainable': 3114552, 'tokens/total': 6793276, 'epoch': '0.8546'} | |
| 85%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 224/263 [2:52:53<34:29, 53.05s/it] 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 225/263 [2:53:48<33:52, 53.49s/it] {'loss': '0.6193', 'grad_norm': '0.2026', 'learning_rate': '1.307e-05', 'ppl': '1.858', 'memory/max_active (GiB)': '34.24', 'memory/max_allocated (GiB)': '34.24', 'memory/device_reserved (GiB)': '42.99', 'tokens/train_per_sec_per_gpu': '257.3', 'tokens/trainable': 3128578, 'tokens/total': 6823484, 'epoch': '0.8584'} | |
| 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 225/263 [2:53:48<33:52, 53.49s/it] 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 226/263 [2:54:34<31:37, 51.28s/it] {'loss': '0.6827', 'grad_norm': '0.2373', 'learning_rate': '1.242e-05', 'ppl': '1.979', 'memory/max_active (GiB)': '39.13', 'memory/max_allocated (GiB)': '39.13', 'memory/device_reserved (GiB)': '54.05', 'tokens/train_per_sec_per_gpu': '285.3', 'tokens/trainable': 3141735, 'tokens/total': 6855924, 'epoch': '0.8622'} | |
| 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 226/263 [2:54:34<31:37, 51.28s/it] 86%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 227/263 [2:55:15<29:03, 48.44s/it] {'loss': '0.6604', 'grad_norm': '0.1903', 'learning_rate': '1.179e-05', 'ppl': '1.935', 'memory/max_active (GiB)': '30.3', 'memory/max_allocated (GiB)': '30.3', 'memory/device_reserved (GiB)': '40.04', 'tokens/train_per_sec_per_gpu': '315.4', 'tokens/trainable': 3154923, 'tokens/total': 6884738, 'epoch': '0.866'} | |
| 86%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 227/263 [2:55:15<29:03, 48.44s/it] 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 228/263 [2:55:58<27:15, 46.74s/it] {'loss': '0.6514', 'grad_norm': '0.2107', 'learning_rate': '1.117e-05', 'ppl': '1.918', 'memory/max_active (GiB)': '22.9', 'memory/max_allocated (GiB)': '22.9', 'memory/device_reserved (GiB)': '34.14', 'tokens/train_per_sec_per_gpu': '252.7', 'tokens/trainable': 3165734, 'tokens/total': 6907742, 'epoch': '0.8698'} | |
| 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 228/263 [2:55:58<27:15, 46.74s/it] 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 229/263 [2:56:54<27:58, 49.37s/it] {'loss': '0.6451', 'grad_norm': '0.1984', 'learning_rate': '1.057e-05', 'ppl': '1.906', 'memory/max_active (GiB)': '43.49', 'memory/max_allocated (GiB)': '43.49', 'memory/device_reserved (GiB)': '60.71', 'tokens/train_per_sec_per_gpu': '272.3', 'tokens/trainable': 3180849, 'tokens/total': 6937996, 'epoch': '0.8736'} | |
| 87%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 229/263 [2:56:54<27:58, 49.37s/it] 87%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 230/263 [2:57:41<26:45, 48.65s/it] {'loss': '0.6366', 'grad_norm': '0.1864', 'learning_rate': '9.985e-06', 'ppl': '1.89', 'memory/max_active (GiB)': '35.46', 'memory/max_allocated (GiB)': '35.46', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '330', 'tokens/trainable': 3196343, 'tokens/total': 6970136, 'epoch': '0.8774'} | |
| 87%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 230/263 [2:57:41<26:45, 48.65s/it] 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 231/263 [2:58:23<24:51, 46.60s/it] {'loss': '0.6353', 'grad_norm': '0.1761', 'learning_rate': '9.416e-06', 'ppl': '1.888', 'memory/max_active (GiB)': '34.63', 'memory/max_allocated (GiB)': '34.63', 'memory/device_reserved (GiB)': '44.57', 'tokens/train_per_sec_per_gpu': '386.8', 'tokens/trainable': 3212523, 'tokens/total': 7006106, 'epoch': '0.8813'} | |
| 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 231/263 [2:58:23<24:51, 46.60s/it] 88%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 232/263 [2:59:15<24:55, 48.25s/it] {'loss': '0.5426', 'grad_norm': '0.1728', 'learning_rate': '8.862e-06', 'ppl': '1.721', 'memory/max_active (GiB)': '38.02', 'memory/max_allocated (GiB)': '38.02', 'memory/device_reserved (GiB)': '48.33', 'tokens/train_per_sec_per_gpu': '275.8', 'tokens/trainable': 3226889, 'tokens/total': 7039494, 'epoch': '0.8851'} | |
| 88%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 232/263 [2:59:15<24:55, 48.25s/it] 89%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 233/263 [2:59:59<23:35, 47.20s/it] {'loss': '0.6979', 'grad_norm': '0.2131', 'learning_rate': '8.325e-06', 'ppl': '2.01', 'memory/max_active (GiB)': '35.02', 'memory/max_allocated (GiB)': '35.02', 'memory/device_reserved (GiB)': '44.24', 'tokens/train_per_sec_per_gpu': '322.7', 'tokens/trainable': 3241328, 'tokens/total': 7070424, 'epoch': '0.8889'} | |
| 89%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 233/263 [2:59:59<23:35, 47.20s/it] 89%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 234/263 [3:00:41<22:00, 45.55s/it] {'loss': '0.6724', 'grad_norm': '0.2417', 'learning_rate': '7.803e-06', 'ppl': '1.959', 'memory/max_active (GiB)': '28.09', 'memory/max_allocated (GiB)': '28.09', 'memory/device_reserved (GiB)': '36.8', 'tokens/train_per_sec_per_gpu': '255', 'tokens/trainable': 3251965, 'tokens/total': 7093680, 'epoch': '0.8927'} | |
| 89%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 234/263 [3:00:41<22:00, 45.55s/it] 89%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 235/263 [3:01:27<21:14, 45.53s/it] {'loss': '0.6359', 'grad_norm': '0.1907', 'learning_rate': '7.298e-06', 'ppl': '1.889', 'memory/max_active (GiB)': '36.62', 'memory/max_allocated (GiB)': '36.62', 'memory/device_reserved (GiB)': '46.28', 'tokens/train_per_sec_per_gpu': '325.9', 'tokens/trainable': 3266781, 'tokens/total': 7123902, 'epoch': '0.8965'} | |
| 89%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 235/263 [3:01:27<21:14, 45.53s/it] 90%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 236/263 [3:02:15<20:50, 46.32s/it] {'loss': '0.6866', 'grad_norm': '0.202', 'learning_rate': '6.809e-06', 'ppl': '1.987', 'memory/max_active (GiB)': '38.62', 'memory/max_allocated (GiB)': '38.62', 'memory/device_reserved (GiB)': '53.25', 'tokens/train_per_sec_per_gpu': '298.6', 'tokens/trainable': 3281164, 'tokens/total': 7156358, 'epoch': '0.9003'} | |
| 90%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 236/263 [3:02:15<20:50, 46.32s/it] 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 237/263 [3:02:56<19:22, 44.71s/it] {'loss': '0.6841', 'grad_norm': '0.2104', 'learning_rate': '6.337e-06', 'ppl': '1.982', 'memory/max_active (GiB)': '36.98', 'memory/max_allocated (GiB)': '36.98', 'memory/device_reserved (GiB)': '46.64', 'tokens/train_per_sec_per_gpu': '374.6', 'tokens/trainable': 3296514, 'tokens/total': 7188274, 'epoch': '0.9041'} | |
| 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 237/263 [3:02:56<19:22, 44.71s/it] 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 238/263 [3:03:46<19:19, 46.39s/it] {'loss': '0.6678', 'grad_norm': '0.1991', 'learning_rate': '5.881e-06', 'ppl': '1.95', 'memory/max_active (GiB)': '58.32', 'memory/max_allocated (GiB)': '58.32', 'memory/device_reserved (GiB)': '75.86', 'tokens/train_per_sec_per_gpu': '327.5', 'tokens/trainable': 3312986, 'tokens/total': 7224448, 'epoch': '0.908'} | |
| 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 238/263 [3:03:46<19:19, 46.39s/it] 91%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 239/263 [3:04:33<18:35, 46.48s/it] {'loss': '0.6174', 'grad_norm': '0.1702', 'learning_rate': '5.441e-06', 'ppl': '1.854', 'memory/max_active (GiB)': '31.28', 'memory/max_allocated (GiB)': '31.28', 'memory/device_reserved (GiB)': '41.57', 'tokens/train_per_sec_per_gpu': '353.2', 'tokens/trainable': 3329480, 'tokens/total': 7253234, 'epoch': '0.9118'} | |
| 91%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 239/263 [3:04:33<18:35, 46.48s/it] 91%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 240/263 [3:05:26<18:37, 48.60s/it] {'loss': '0.6532', 'grad_norm': '0.1679', 'learning_rate': '5.018e-06', 'ppl': '1.922', 'memory/max_active (GiB)': '34.53', 'memory/max_allocated (GiB)': '34.53', 'memory/device_reserved (GiB)': '43.5', 'tokens/train_per_sec_per_gpu': '319.1', 'tokens/trainable': 3346561, 'tokens/total': 7292512, 'epoch': '0.9156'} | |
| 91%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 240/263 [3:05:26<18:37, 48.60s/it] 92%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 241/263 [3:06:09<17:10, 46.84s/it] {'loss': '0.6461', 'grad_norm': '0.2172', 'learning_rate': '4.612e-06', 'ppl': '1.908', 'memory/max_active (GiB)': '30.52', 'memory/max_allocated (GiB)': '30.52', 'memory/device_reserved (GiB)': '40.36', 'tokens/train_per_sec_per_gpu': '270.1', 'tokens/trainable': 3358110, 'tokens/total': 7319476, 'epoch': '0.9194'} | |
| 92%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 241/263 [3:06:09<17:10, 46.84s/it] 92%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 242/263 [3:07:13<18:14, 52.12s/it] {'loss': '0.6017', 'grad_norm': '0.1813', 'learning_rate': '4.222e-06', 'ppl': '1.825', 'memory/max_active (GiB)': '46.95', 'memory/max_allocated (GiB)': '46.95', 'memory/device_reserved (GiB)': '66.02', 'tokens/train_per_sec_per_gpu': '347.9', 'tokens/trainable': 3380518, 'tokens/total': 7369626, 'epoch': '0.9232'} | |
| 92%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 242/263 [3:07:13<18:14, 52.12s/it] 92%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 243/263 [3:08:11<17:54, 53.72s/it] {'loss': '0.5814', 'grad_norm': '0.1749', 'learning_rate': '3.85e-06', 'ppl': '1.789', 'memory/max_active (GiB)': '39.53', 'memory/max_allocated (GiB)': '39.53', 'memory/device_reserved (GiB)': '58.11', 'tokens/train_per_sec_per_gpu': '306.3', 'tokens/trainable': 3398118, 'tokens/total': 7406506, 'epoch': '0.927'} | |
| 92%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 243/263 [3:08:11<17:54, 53.72s/it] 93%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 244/263 [3:08:58<16:25, 51.87s/it] {'loss': '0.632', 'grad_norm': '0.1955', 'learning_rate': '3.494e-06', 'ppl': '1.881', 'memory/max_active (GiB)': '49.89', 'memory/max_allocated (GiB)': '49.89', 'memory/device_reserved (GiB)': '70.63', 'tokens/train_per_sec_per_gpu': '358.4', 'tokens/trainable': 3415167, 'tokens/total': 7447132, 'epoch': '0.9309'} | |
| 93%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 244/263 [3:08:58<16:25, 51.87s/it] 93%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 245/263 [3:09:42<14:46, 49.25s/it] {'loss': '0.683', 'grad_norm': '0.2159', 'learning_rate': '3.155e-06', 'ppl': '1.98', 'memory/max_active (GiB)': '31.1', 'memory/max_allocated (GiB)': '31.1', 'memory/device_reserved (GiB)': '41.18', 'tokens/train_per_sec_per_gpu': '306.6', 'tokens/trainable': 3428390, 'tokens/total': 7476600, 'epoch': '0.9347'} | |
| 93%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 245/263 [3:09:42<14:46, 49.25s/it] 94%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 246/263 [3:11:01<16:32, 58.35s/it] {'loss': '0.6627', 'grad_norm': '0.1617', 'learning_rate': '2.833e-06', 'ppl': '1.94', 'memory/max_active (GiB)': '57.27', 'memory/max_allocated (GiB)': '57.27', 'memory/device_reserved (GiB)': '74.5', 'tokens/train_per_sec_per_gpu': '342.7', 'tokens/trainable': 3455668, 'tokens/total': 7533052, 'epoch': '0.9385'} | |
| 94%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 246/263 [3:11:01<16:32, 58.35s/it] 94%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 247/263 [3:11:31<13:18, 49.90s/it] {'loss': '0.6176', 'grad_norm': '0.1991', 'learning_rate': '2.528e-06', 'ppl': '1.855', 'memory/max_active (GiB)': '27.2', 'memory/max_allocated (GiB)': '27.2', 'memory/device_reserved (GiB)': '35.37', 'tokens/train_per_sec_per_gpu': '385.7', 'tokens/trainable': 3467302, 'tokens/total': 7557486, 'epoch': '0.9423'} | |
| 94%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 247/263 [3:11:31<13:18, 49.90s/it] 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 248/263 [3:12:15<11:59, 47.94s/it] {'loss': '0.6939', 'grad_norm': '0.2216', 'learning_rate': '2.241e-06', 'ppl': '2.002', 'memory/max_active (GiB)': '44.94', 'memory/max_allocated (GiB)': '44.94', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '307.3', 'tokens/trainable': 3480626, 'tokens/total': 7588958, 'epoch': '0.9461'} | |
| 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 248/263 [3:12:15<11:59, 47.94s/it] 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 249/263 [3:13:08<11:32, 49.44s/it] {'loss': '0.6791', 'grad_norm': '0.2064', 'learning_rate': '1.97e-06', 'ppl': '1.972', 'memory/max_active (GiB)': '39.77', 'memory/max_allocated (GiB)': '39.77', 'memory/device_reserved (GiB)': '62.89', 'tokens/train_per_sec_per_gpu': '323.8', 'tokens/trainable': 3497767, 'tokens/total': 7623894, 'epoch': '0.9499'} | |
| 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 249/263 [3:13:08<11:32, 49.44s/it] 95%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/263 [3:14:15<11:50, 54.69s/it] {'loss': '0.6591', 'grad_norm': '0.2037', 'learning_rate': '1.717e-06', 'ppl': '1.933', 'memory/max_active (GiB)': '55.75', 'memory/max_allocated (GiB)': '55.75', 'memory/device_reserved (GiB)': '72.33', 'tokens/train_per_sec_per_gpu': '270.7', 'tokens/trainable': 3515893, 'tokens/total': 7665338, 'epoch': '0.9537'} | |
| 95%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/263 [3:14:15<11:50, 54.69s/it] 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 251/263 [3:14:50<09:48, 49.05s/it] {'loss': '0.6294', 'grad_norm': '0.2187', 'learning_rate': '1.481e-06', 'ppl': '1.876', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '32.43', 'tokens/train_per_sec_per_gpu': '270.5', 'tokens/trainable': 3525599, 'tokens/total': 7687250, 'epoch': '0.9576'} | |
| 95%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 251/263 [3:14:50<09:48, 49.05s/it] 96%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 252/263 [3:15:41<09:04, 49.48s/it] {'loss': '0.545', 'grad_norm': '0.1694', 'learning_rate': '1.262e-06', 'ppl': '1.725', 'memory/max_active (GiB)': '48.1', 'memory/max_allocated (GiB)': '48.1', 'memory/device_reserved (GiB)': '67.93', 'tokens/train_per_sec_per_gpu': '384.7', 'tokens/trainable': 3545027, 'tokens/total': 7722636, 'epoch': '0.9614'} | |
| 96%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 252/263 [3:15:41<09:04, 49.48s/it] 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 253/263 [3:16:51<09:16, 55.62s/it] {'loss': '0.5997', 'grad_norm': '0.1826', 'learning_rate': '1.061e-06', 'ppl': '1.822', 'memory/max_active (GiB)': '50.46', 'memory/max_allocated (GiB)': '50.46', 'memory/device_reserved (GiB)': '71.27', 'tokens/train_per_sec_per_gpu': '251.7', 'tokens/trainable': 3562629, 'tokens/total': 7766154, 'epoch': '0.9652'} | |
| 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 253/263 [3:16:51<09:16, 55.62s/it] 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 254/263 [3:17:55<08:42, 58.06s/it] {'loss': '0.6639', 'grad_norm': '0.187', 'learning_rate': '8.773e-07', 'ppl': '1.942', 'memory/max_active (GiB)': '51.43', 'memory/max_allocated (GiB)': '51.43', 'memory/device_reserved (GiB)': '72.87', 'tokens/train_per_sec_per_gpu': '309.8', 'tokens/trainable': 3582377, 'tokens/total': 7807646, 'epoch': '0.969'} | |
| 97%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 254/263 [3:17:55<08:42, 58.06s/it] 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 255/263 [3:18:36<07:04, 53.09s/it] {'loss': '0.7493', 'grad_norm': '0.2242', 'learning_rate': '7.108e-07', 'ppl': '2.116', 'memory/max_active (GiB)': '34.97', 'memory/max_allocated (GiB)': '34.97', 'memory/device_reserved (GiB)': '44.01', 'tokens/train_per_sec_per_gpu': '305.2', 'tokens/trainable': 3595039, 'tokens/total': 7833474, 'epoch': '0.9728'} | |
| 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 255/263 [3:18:36<07:04, 53.09s/it] 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 256/263 [3:19:23<05:58, 51.28s/it] {'loss': '0.6764', 'grad_norm': '0.2359', 'learning_rate': '5.618e-07', 'ppl': '1.967', 'memory/max_active (GiB)': '21.36', 'memory/max_allocated (GiB)': '21.36', 'memory/device_reserved (GiB)': '26.67', 'tokens/train_per_sec_per_gpu': '177.1', 'tokens/trainable': 3603378, 'tokens/total': 7853228, 'epoch': '0.9766'} | |
| 97%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 256/263 [3:19:23<05:58, 51.28s/it] 98%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 257/263 [3:20:16<05:10, 51.73s/it] {'loss': '0.5852', 'grad_norm': '0.1929', 'learning_rate': '4.302e-07', 'ppl': '1.795', 'memory/max_active (GiB)': '25.94', 'memory/max_allocated (GiB)': '25.94', 'memory/device_reserved (GiB)': '33.87', 'tokens/train_per_sec_per_gpu': '263.3', 'tokens/trainable': 3617272, 'tokens/total': 7882296, 'epoch': '0.9804'} | |
| 98%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 257/263 [3:20:16<05:10, 51.73s/it] 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 258/263 [3:21:05<04:14, 50.88s/it] {'loss': '0.709', 'grad_norm': '0.2294', 'learning_rate': '3.161e-07', 'ppl': '2.032', 'memory/max_active (GiB)': '46.35', 'memory/max_allocated (GiB)': '46.35', 'memory/device_reserved (GiB)': '65.2', 'tokens/train_per_sec_per_gpu': '308.2', 'tokens/trainable': 3632340, 'tokens/total': 7915626, 'epoch': '0.9843'} | |
| 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 258/263 [3:21:05<04:14, 50.88s/it] 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 259/263 [3:21:34<02:57, 44.37s/it] {'loss': '0.5761', 'grad_norm': '0.2068', 'learning_rate': '2.196e-07', 'ppl': '1.779', 'memory/max_active (GiB)': '24.15', 'memory/max_allocated (GiB)': '24.15', 'memory/device_reserved (GiB)': '39.5', 'tokens/train_per_sec_per_gpu': '365.9', 'tokens/trainable': 3643011, 'tokens/total': 7937480, 'epoch': '0.9881'} | |
| 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 259/263 [3:21:34<02:57, 44.37s/it] 99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 260/263 [3:22:14<02:09, 43.19s/it] {'loss': '0.654', 'grad_norm': '0.1978', 'learning_rate': '1.405e-07', 'ppl': '1.923', 'memory/max_active (GiB)': '29.8', 'memory/max_allocated (GiB)': '29.8', 'memory/device_reserved (GiB)': '39.24', 'tokens/train_per_sec_per_gpu': '340.3', 'tokens/trainable': 3656769, 'tokens/total': 7966668, 'epoch': '0.9919'} | |
| 99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 260/263 [3:22:14<02:09, 43.19s/it] 99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 261/263 [3:23:10<01:34, 47.04s/it] {'loss': '0.6804', 'grad_norm': '0.2136', 'learning_rate': '7.906e-08', 'ppl': '1.975', 'memory/max_active (GiB)': '32.49', 'memory/max_allocated (GiB)': '32.49', 'memory/device_reserved (GiB)': '41.61', 'tokens/train_per_sec_per_gpu': '282.8', 'tokens/trainable': 3672618, 'tokens/total': 8001206, 'epoch': '0.9957'} | |
| 99%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 261/263 [3:23:10<01:34, 47.04s/it] 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 262/263 [3:24:03<00:48, 48.74s/it] {'loss': '0.5862', 'grad_norm': '0.2206', 'learning_rate': '3.514e-08', 'ppl': '1.797', 'memory/max_active (GiB)': '25.21', 'memory/max_allocated (GiB)': '25.21', 'memory/device_reserved (GiB)': '40.98', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3683361, 'tokens/total': 8024868, 'epoch': '0.9995'} | |
| 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 262/263 [3:24:03<00:48, 48.74s/it] 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 263/263 [3:24:08<00:00, 35.60s/it] {'loss': '0.6559', 'grad_norm': '0.6503', 'learning_rate': '8.786e-09', 'ppl': '1.927', 'memory/max_active (GiB)': '17.39', 'memory/max_allocated (GiB)': '17.39', 'memory/device_reserved (GiB)': '32.52', 'tokens/train_per_sec_per_gpu': '203.9', 'tokens/trainable': 3684370, 'tokens/total': 8026968, 'epoch': '1'} | |
| 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 263/263 [3:24:08<00:00, 35.60s/it][2026-06-14 17:37:06,485] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B/checkpoint-263 | |
| {'train_runtime': '1.225e+04', 'train_samples_per_second': '0.343', 'train_steps_per_second': '0.021', 'train_loss': '0.731', 'memory/max_active (GiB)': '11.01', 'memory/max_allocated (GiB)': '11.01', 'memory/device_reserved (GiB)': '20.05', 'epoch': '1'} | |
| 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 263/263 [3:24:11<00:00, 35.60s/it] 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 263/263 [3:24:11<00:00, 46.58s/it] | |
| [2026-06-14 17:37:12,240] [INFO] [axolotl.train.save_trained_model:267] [PID:3393] Training completed! Saving trained model to ./outputs/Jacob-2-E4B. | |
| [2026-06-14 17:37:12,857] [INFO] [axolotl.train.save_trained_model:388] [PID:3393] Model successfully saved to ./outputs/Jacob-2-E4B | |
| [2026-06-14 17:37:13,576] [INFO] [axolotl.core.trainers.base._save:846] [PID:3393] Saving model checkpoint to ./outputs/Jacob-2-E4B | |
| Processing Files (0 / 0) : | | 0.00B / 0.00B | |
| New Data Upload : | | 0.00B / 0.00B [A | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB [A[A | |
| ...adapter_model.safetensors: 69%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 95.9MB / 140MB [A[A[A | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB [A[A[A[A | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB [A[A | |
| ...adapter_model.safetensors: 69%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 95.9MB / 140MB [A[A[A | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB [A[A[A[A Processing Files (2 / 3) : 75%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 128MB / 172MB, ???B/s | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB [A[A | |
| ...adapter_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 140MB / 140MB [A[A[A | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB [A[A[A[A Processing Files (3 / 3) : 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 172MB / 172MB, 219MB/s | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB [A[A | |
| ...adapter_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 140MB / 140MB [A[A[A | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB [A[A[A[A | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB [A[A | |
| ...adapter_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 140MB / 140MB [A[A[A | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB [A[A[A[A Processing Files (3 / 3) : 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 172MB / 172MB, 109MB/s | |
| New Data Upload : | | 0.00B / 0.00B, 0.00B/s | |
| ...b-2-E4B/training_args.bin: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24.2kB / 24.2kB | |
| ...adapter_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 140MB / 140MB | |
| ...acob-2-E4B/tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32.2MB / 32.2MB |